• Introduction
  • Packages
  • Exploratory Data Analysis
    • Initial Data Analysis
    • Feature Engineering/Cleaning
    • Visualization
    • Outliers
      • Numertic Review
      • Categorical Review
      • Outlier Detection Model
  • Data Preprocessing
    • Split Dateset
    • Train/Test sets
    • Transform Categorical and Numerical features
  • Metrics Definitions
  • Model Training
    • Logistic Regression
      • Base Model
      • Cross Validation with Paramters
      • Metrics and Feature Importance
    • Support Vector Machine
      • Base Model
      • Cross validation with paramters
      • Test Set
    • Random Forest
      • Base Model
      • Crossvalidation with parameters
      • Test Set
      • Feature Importance
    • Gradient Boosting
      • Base Model
      • Cross Validcation with parameters
      • Metrics and Feature Importance
    • Extreme Gradient Boosting
      • Base Model
      • Cross Validcation with Parameters
      • Test Set
      • Feature Importance
    • Model Comparison
  • New Data
    • Feature Engineering
    • Data Preparation
    • Predict and Evaluate


Introduction

Insurance fraud poses a concern across various sectors, including healthcare, homeownership, and automobile coverage. Its impact extends beyond the financial burden on insurers, affecting even non-fraudulent policyholders.

This analysis zeroes in on fraud within the auto insurance industry in India. The dataset utilized for this project was sourced from Kaggle (https://www.kaggle.com/).

Our objective is to leverage classification models to predict fraudulent auto insurance claims. Various classification models will be evaluated based on their efficacy in accurately predicting instances of actual fraud.

Individuals may not have an interest in every section of this analysis. Specific sections of interest can be directly accessed through the table of contents on the left. For instance, clicking on “Models” will swiftly take one to the classification models section.

The following programs were used for this project.

Python 3.10.10

R 4.2.2 (Specific Visualizations)



Packages




import pandas as pd

import pickle

from pandas.api.types import is_numeric_dtype
from pandas.api.types import is_categorical_dtype

import datetime

from datetime import date

from dateutil.relativedelta import relativedelta

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

import plotly.io as pio

import matplotlib.patches as mpatches

from plotnine import *

import plotnine

import scipy
from scipy import stats


from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder


from sklearn.compose import ColumnTransformer, make_column_transformer

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.compose import make_column_selector as selector

from sklearn.preprocessing import OneHotEncoder


from sklearn import set_config

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay, accuracy_score, roc_auc_score, recall_score, RocCurveDisplay, precision_score, f1_score, make_scorer

from sklearn import metrics

from sklearn.inspection import permutation_importance

from sklearn.model_selection import cross_val_score,StratifiedKFold


import time

from io import StringIO
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from mlxtend.frequent_patterns import apriori
from sklearn.ensemble import IsolationForest
from sklearn.tree import DecisionTreeClassifier


Exploratory Data Analysis



Initial Data Analysis



The data was downloaded as five individual data sets. We will review each data set for suitability of being merged into one data set.




## ************Train_Claim_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 19 columns):
##  #   Column                 Non-Null Count  Dtype 
## ---  ------                 --------------  ----- 
##  0   CustomerID             28836 non-null  object
##  1   DateOfIncident         28836 non-null  object
##  2   TypeOfIncident         28836 non-null  object
##  3   TypeOfCollission       28836 non-null  object
##  4   SeverityOfIncident     28836 non-null  object
##  5   AuthoritiesContacted   28836 non-null  object
##  6   IncidentState          28836 non-null  object
##  7   IncidentCity           28836 non-null  object
##  8   IncidentAddress        28836 non-null  object
##  9   IncidentTime           28836 non-null  int32 
##  10  NumberOfVehicles       28836 non-null  int32 
##  11  PropertyDamage         28836 non-null  object
##  12  BodilyInjuries         28836 non-null  int32 
##  13  Witnesses              28836 non-null  object
##  14  PoliceReport           28836 non-null  object
##  15  AmountOfInjuryClaim    28836 non-null  int32 
##  16  AmountOfPropertyClaim  28836 non-null  int32 
##  17  AmountOfVehicleDamage  28836 non-null  int32 
##  18  AmountOfTotalClaim     28836 non-null  int32 
## dtypes: int32(7), object(12)
## memory usage: 3.4+ MB



## ************Train_Policy_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
##  #   Column                      Non-Null Count  Dtype  
## ---  ------                      --------------  -----  
##  0   InsurancePolicyNumber       28836 non-null  int32  
##  1   CustomerLoyaltyPeriod       28836 non-null  int32  
##  2   DateOfPolicyCoverage        28836 non-null  object 
##  3   InsurancePolicyState        28836 non-null  object 
##  4   Policy_CombinedSingleLimit  28836 non-null  object 
##  5   Policy_Deductible           28836 non-null  int32  
##  6   PolicyAnnualPremium         28836 non-null  float64
##  7   UmbrellaLimit               28836 non-null  int32  
##  8   InsuredRelationship         28836 non-null  object 
##  9   CustomerID                  28836 non-null  object 
## dtypes: float64(1), int32(4), object(5)
## memory usage: 1.8+ MB



## ************Train_Demographics_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
##  #   Column                 Non-Null Count  Dtype 
## ---  ------                 --------------  ----- 
##  0   CustomerID             28836 non-null  object
##  1   InsuredAge             28836 non-null  int32 
##  2   InsuredZipCode         28836 non-null  int32 
##  3   InsuredGender          28836 non-null  object
##  4   InsuredEducationLevel  28836 non-null  object
##  5   InsuredOccupation      28836 non-null  object
##  6   InsuredHobbies         28836 non-null  object
##  7   CapitalGains           28836 non-null  int32 
##  8   CapitalLoss            28836 non-null  int32 
##  9   Country                28836 non-null  object
## dtypes: int32(4), object(6)
## memory usage: 1.8+ MB



## **********Traindata_with_Targeet_p Information**********
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 2 columns):
##  #   Column         Non-Null Count  Dtype 
## ---  ------         --------------  ----- 
##  0   CustomerID     28836 non-null  object
##  1   ReportedFraud  28836 non-null  object
## dtypes: object(2)
## memory usage: 450.7+ KB



## ************Train_Vehicle_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 115344 entries, 0 to 115343
## Data columns (total 3 columns):
##  #   Column                   Non-Null Count   Dtype 
## ---  ------                   --------------   ----- 
##  0   CustomerID               115344 non-null  object
##  1   VehicleAttribute         115344 non-null  object
##  2   VehicleAttributeDetails  115344 non-null  object
## dtypes: object(3)
## memory usage: 2.6+ MB



## *************Train_Vehicle_p First 25 Rows*************
##    CustomerID VehicleAttribute VehicleAttributeDetails
## 0   Cust20179        VehicleID             Vehicle8898
## 1   Cust21384     VehicleModel                  Malibu
## 2   Cust33335      VehicleMake                  Toyota
## 3   Cust27118     VehicleModel                    Neon
## 4   Cust13038        VehicleID            Vehicle30212
## 5    Cust1801        VehicleID            Vehicle24096
## 6   Cust30237     VehicleModel                     RAM
## 7   Cust21334       VehicleYOM                    1996
## 8   Cust26634       VehicleYOM                    1999
## 9   Cust20624      VehicleMake               Chevrolet
## 10  Cust14947        VehicleID            Vehicle15216
## 11  Cust21432       VehicleYOM                    2002
## 12  Cust22845       VehicleYOM                    2000
## 13   Cust9006      VehicleMake                  Accura
## 14  Cust30659       VehicleYOM                    2003
## 15  Cust18447      VehicleMake                   Honda
## 16  Cust19144        VehicleID            Vehicle29018
## 17  Cust26846        VehicleID            Vehicle21867
## 18   Cust4801       VehicleYOM                    1998
## 19  Cust18081       VehicleYOM                    2013
## 20  Cust17021      VehicleMake                     BMW
## 21  Cust30660       VehicleYOM                    2002
## 22  Cust22099        VehicleID            Vehicle30877
## 23  Cust33560       VehicleYOM                    2011
## 24  Cust17371       VehicleYOM                    2001





The data sets train claim, train policy, train demographics, and train with target are ready to be merged into one data set.

Viewing the first twenty-five rows of the Train Vehicle data column VehicleAttribute we can see that it has multiple repeating rows as each customerID is as associated with Vehicle Model, Vehicle Make, Vehicle ID, and Vehicle YOM. The number of rows is 115344 which is four times the rows of the other data sets. This data set will have to be modified before it can be merged with the other data sets. Each level should be an individual feature matching to its corresponding level in the VehicleAtributeDetails feature. This will be accomplished by making the Train Vehicle data set wider. We will spread out the Vehicle Attribute feature so each level will become a feature. This will create a new data set that is shorter and wider.




train_vehicle_wide=Train_Vehicle_p.pivot(index='CustomerID',columns='VehicleAttribute',values='VehicleAttributeDetails').reset_index()



## ************train_vehicle_wide Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 5 columns):
##  #   Column        Non-Null Count  Dtype 
## ---  ------        --------------  ----- 
##  0   CustomerID    28836 non-null  object
##  1   VehicleID     28836 non-null  object
##  2   VehicleMake   28836 non-null  object
##  3   VehicleModel  28836 non-null  object
##  4   VehicleYOM    28836 non-null  object
## dtypes: object(5)
## memory usage: 1.1+ MB



## *************train_vehicle_wide first 50 rows*************
## VehicleAttribute CustomerID     VehicleID VehicleMake VehicleModel VehicleYOM
## 0                 Cust10000  Vehicle26917        Audi           A5       2008
## 1                 Cust10001  Vehicle15893        Audi           A5       2006
## 2                 Cust10002   Vehicle5152  Volkswagen        Jetta       1999
## 3                 Cust10003  Vehicle37363  Volkswagen        Jetta       2003
## 4                 Cust10004  Vehicle28633      Toyota          CRV       2010
## 5                 Cust10005  Vehicle26409      Toyota          CRV       2011
## 6                 Cust10006  Vehicle12114    Mercedes         C300       2000
## 7                 Cust10007  Vehicle26987      Suburu         C300       2010
## 8                 Cust10009  Vehicle12490  Volkswagen       Passat       1995
## 9                  Cust1001  Vehicle28516        Saab          92x       2004
## 10                Cust10011   Vehicle8940      Nissan       Ultima       2002
## 11                Cust10012   Vehicle9379        Ford       Fusion       2004
## 12                Cust10013  Vehicle22024      Accura       Fusion       2001
## 13                Cust10014   Vehicle3601      Suburu      Impreza       2011
## 14                Cust10016   Vehicle7515        Saab          92x       2005
## 15                Cust10017  Vehicle31838        Saab          92x       2005
## 16                Cust10018  Vehicle35954      Toyota           93       2000
## 17                Cust10019  Vehicle19647        Saab           93       2000
## 18                Cust10021  Vehicle37694  Volkswagen       Passat       2006
## 19                Cust10022  Vehicle31889      Toyota   Highlander       1997
## 20                Cust10023  Vehicle10464      Toyota   Highlander       1999
## 21                Cust10024  Vehicle24452       Dodge           X5       2001
## 22                Cust10025  Vehicle12734       Dodge           X5       2002
## 23                Cust10026  Vehicle14492  Volkswagen       Passat       2001
## 24                Cust10027  Vehicle38970        Saab       Passat       1995
## 25                Cust10028   Vehicle3996       Honda       Accord       2015
## 26                Cust10029  Vehicle12477      Toyota      Corolla       2015
## 27                Cust10030  Vehicle34293        Ford    Forrestor       2006
## 28                Cust10031  Vehicle33775      Suburu         F150       2005
## 29                Cust10032  Vehicle34708      Nissan   Pathfinder       2012
## 30                Cust10034  Vehicle26030        Saab          92x       2006
## 31                Cust10035   Vehicle3961        Saab        Jetta       2007
## 32                Cust10037  Vehicle38667       Dodge         Neon       2012
## 33                 Cust1004  Vehicle17051   Chevrolet        Tahoe       2014
## 34                Cust10040   Vehicle7284        Audi     Wrangler       2007
## 35                Cust10041   Vehicle2119        Jeep           A3       2008
## 36                Cust10042   Vehicle7459      Accura           A5       1997
## 37                Cust10043   Vehicle6244      Accura          RSX       2010
## 38                Cust10044  Vehicle38446   Chevrolet       Malibu       1998
## 39                Cust10046   Vehicle3199        Audi           A5       2011
## 40                Cust10047  Vehicle13780        Audi           A5       2009
## 41                Cust10049  Vehicle35318        Ford         F150       2008
## 42                 Cust1005  Vehicle26158      Accura          RSX       2009
## 43                Cust10051  Vehicle33864       Dodge         E400       2014
## 44                Cust10052  Vehicle16314       Honda       Legacy       2002
## 45                Cust10053  Vehicle35570      Suburu       Legacy       2000
## 46                Cust10054  Vehicle13054        Audi       Ultima       2006
## 47                Cust10057  Vehicle23410      Suburu       Legacy       2005
## 48                Cust10058  Vehicle24044         BMW          92x       2005
## 49                Cust10059  Vehicle25575         BMW           X5       2006



We have taken the data from train vehicle and created a new data set called train vehicle wide. This new data set has four new columns and 28836 rows which now matches the other four data sets. We are now ready to merge all data sets.





fraud=Train_Claim_p.merge(Train_Demographics_p, on="CustomerID")\
.merge(Train_Policy_p, on="CustomerID")\
.merge(train_vehicle_wide, on="CustomerID")\
.merge(Traindata_with_Target_p, on="CustomerID")



We’ll now test to ensure our data joins and transformations have returned a dataframe.





# Function to check if Data Frame
def check_is_dataframe(df):
    assert isinstance(df, pd.DataFrame), f"Error: object is not Data Frame."
    print("Object is Data Frame")


check_is_dataframe(fraud)
## Object is Data Frame



## *******************fraud Data Types*******************
## CustomerID                     object
## DateOfIncident                 object
## TypeOfIncident                 object
## TypeOfCollission               object
## SeverityOfIncident             object
## AuthoritiesContacted           object
## IncidentState                  object
## IncidentCity                   object
## IncidentAddress                object
## IncidentTime                    int32
## NumberOfVehicles                int32
## PropertyDamage                 object
## BodilyInjuries                  int32
## Witnesses                      object
## PoliceReport                   object
## AmountOfInjuryClaim             int32
## AmountOfPropertyClaim           int32
## AmountOfVehicleDamage           int32
## AmountOfTotalClaim              int32
## InsuredAge                      int32
## InsuredZipCode                  int32
## InsuredGender                  object
## InsuredEducationLevel          object
## InsuredOccupation              object
## InsuredHobbies                 object
## CapitalGains                    int32
## CapitalLoss                     int32
## Country                        object
## InsurancePolicyNumber           int32
## CustomerLoyaltyPeriod           int32
## DateOfPolicyCoverage           object
## InsurancePolicyState           object
## Policy_CombinedSingleLimit     object
## Policy_Deductible               int32
## PolicyAnnualPremium           float64
## UmbrellaLimit                   int32
## InsuredRelationship            object
## VehicleID                      object
## VehicleMake                    object
## VehicleModel                   object
## VehicleYOM                     object
## ReportedFraud                  object
## dtype: object



Feature Engineering/Cleaning





Feature engineering encompasses several essential steps.

Firstly, there is feature creation, where new variables are generated from existing features to enhance both our model and data visualization.

Secondly, feature transformation involves converting features from one representation to another. For instance, we might transform a numerical feature into a categorical type.

Cleaning is a crucial process that entails scrutinizing the features. If something appears amiss with a feature, we can address the issue by either eliminating the problematic values or, in some cases, entirely removing the feature. Null values, for instance, can be handled by replacing them with alternative values, removing data points with null values, or, as previously mentioned, excluding the entire feature.



There are features that are dates though they do not have the correct data type. We will create a function to transform these features to a datetime data type.




def convert_to_datetime(df, column_name):
  df[column_name]=pd.to_datetime(df[column_name])




convert_to_datetime(fraud_v2, 'DateOfIncident')



convert_to_datetime(fraud_v2, 'DateOfPolicyCoverage')



We’ll now write a function to confirm if the features have been successfully transformed to a datetime data type






def check_is_datetime(df, column_name):
    assert pd.api.types.is_datetime64_any_dtype(df[column_name]), f"Error: feature '{column_name}' is not datetime dtype."
    print(f"Feature '{column_name}' is datetime dtype")


check_is_datetime(fraud_v2,'DateOfIncident')
## Feature 'DateOfIncident' is datetime dtype


check_is_datetime(fraud_v2,'DateOfPolicyCoverage')
## Feature 'DateOfPolicyCoverage' is datetime dtype


Now that that the features have been transformed to the correct data type, we will now use them to create new features.





fraud_v2["coverageIncidentDiff"]=(fraud_v2["DateOfIncident"]-fraud_v2["DateOfPolicyCoverage"])

fraud_v2["coverageIncidentDiff"]=fraud_v2["coverageIncidentDiff"]/np.timedelta64(1,'Y')



## ************CoverageIncidentDiff************
## count    28836.000000
## mean        13.074582
## std          6.560420
## min         -0.054758
## 25%          7.646290
## 50%         13.172071
## 75%         18.617768
## max         25.123035
## Name: coverageIncidentDiff, dtype: float64





fraud_v2['dayOfWeek'] = fraud_v2["DateOfIncident"].dt.day_name()


## *****dayOfWeek Value Counts*****
## Friday       0.15
## Tuesday      0.15
## Thursday     0.14
## Saturday     0.14
## Wednesday    0.14
## Monday       0.14
## Sunday       0.14
## Name: dayOfWeek, dtype: float64



Certain features are numeric yet may better serve our models as categorical. This can be assessed by checking unique values of these features



## ******** Unique Number of Vehicles********
## [3 1 4 2]


## ******** Unique Bodily Injuries********
## [1 2 0]



The above outputs indicate that both NumberOfVehcicles and BodilyInjuries would be best as type categorical. We will create a function that converts numerical data types to categorical. Then the function will be applied to the selected numerical features.






def convert_to_cat(df, column_name):
  df[column_name]=df[column_name].astype('category')



convert_to_cat(fraud_v2, 'NumberOfVehicles') 




convert_to_cat(fraud_v2, 'BodilyInjuries') 

We use a function to confirm the two features have been transformed to a categorical data type.



def check_is_categorical(df, column_name):
    assert pd.api.types.is_categorical_dtype(df[column_name]), f"Error: feature '{column_name}' is not categorical dtype."
    print(f"Feature '{column_name}' is categorical dtype")


check_is_categorical(fraud_v2,'BodilyInjuries')
## Feature 'BodilyInjuries' is categorical dtype


check_is_categorical(fraud_v2,'NumberOfVehicles')
## Feature 'NumberOfVehicles' is categorical dtype


Both features are now of type category



## *************Incident Time Unique Values*************
## [17 10 22  7 20 18  3  5 14 16 15 13 12  9 19  4 11  1  8  0  6 21 23  2
##  -5]


IncidentTime has unique values that would warrant it becoming categorical, though the many levels would not be optimal for use in our modeling. We can remedy this by placing unique time values into bins using a Python dictionary. This will reduce the number of levels.





time_day={
  
  5:'early morning', 6:'early morning',7:'early morning',  8:'early morning',9:'late morning', 10: 'late morning', 11: 'late morning', 12:'early afternoon', 13:'early afternoon', 14:'early afternoon', 15:'early afternoon',16:'late afternoon', 17:'late afternoon', 18:'evening',
  19:'evening', 20:'night', 1:'night', 2:'night', 3:'night', 4:'night', 21:'night', 22:'night', 23:'night', 24:'night'
}





fraud_v2['IncidentPeriodDay']=fraud_v2['IncidentTime'].map(time_day)


## ***Incident Period Day Value Counts***
## night              7458
## early afternoon    5785
## early morning      5580
## late morning       3661
## late afternoon     3231
## evening            2699
## Name: IncidentPeriodDay, dtype: int64



We find from the value count output for the new feature IncidentPeriodDay that incident times have been placed into six unique periods of the day.






fraud_v3=fraud_v2.copy()

Date features used in creating new features are no longer required and will be removed from the data set




fraud_v3=fraud_v3.drop(['DateOfIncident', 'DateOfPolicyCoverage', 'IncidentTime'], axis=1)


print("fraud_v2 data frame includes datatypes object is", pd.api.types.is_object_dtype(fraud_v2.columns))
## fraud_v2 data frame includes datatypes object is True


For purposes of classification algorithms and visualizations we’ll need to convert all categorical columns (Object Data Type) to the category data type. This will be accomplished by creating a function to identify non-numerical columns and converting them to the category data type.



def convert_cats(df):
      cats = []
      for col in df.columns:
            if pd.api.types.is_object_dtype(df[col]):
                  cats.append(col)
            else:
              pass
      cat_indicies = []
      for col in cats:
            df[col] = df[col].astype('category')
            





convert_cats(fraud_v3)


We’ll write a function to review the dataset and ensure there are no columns of type object.






def check_no_object_dtype(df):
    assert not any(pd.api.types.is_object_dtype(df[col]) for col in df.columns), "Error: DataFrame contains object dtype columns."
    print("✅ No object dtype columns found in the DataFrame.")


check_no_object_dtype(fraud_v3)
## ✅ No object dtype columns found in the DataFrame.


Success. All columns of type object have been transformed to type category.


print(f"Shape of fraud_v2: {fraud_v2.shape}")
## Shape of fraud_v2: (28836, 45)
print(f"Shape of fraud_v3: {fraud_v3.shape}")
## Shape of fraud_v3: (28836, 42)


From the above output we observe that all object data types are now type categorical.



fraud_v3["ReportedFraud"].value_counts(normalize=True).round(2)
## N    0.73
## Y    0.27
## Name: ReportedFraud, dtype: float64



gs=plt.GridSpec(1, 3)
fig=plt.figure(figsize=(10,8))
fig.suptitle('Categorical Counts-1', fontsize=8)


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])

#plt.title('Type of Incident',fontsize=7, y=1)
hg=sns.countplot(data = fraud_v3, x = 'TypeOfIncident', ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Type of Incident", fontsize=5) 
hg.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
sp=sns.countplot(data=fraud_v3, x='TypeOfCollission', ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=5)
sp.set_xlabel("Type of Collision", fontsize=5) 
sp.set_ylabel("Count",fontsize=4) 
#plt.title('Reported Fraud',fontsize=7, y=1)
bp=sns.countplot(data=fraud_v3, x='ReportedFraud', ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Reportered Fraud", fontsize=5) 
bp.set_ylabel("Count", fontsize=5) 

plt.tight_layout()

plt.show()

plt.clf()

Next, we’ll check the data for missing values (NAs or null) or values that are unknown. Unknown values may have terms denoting that the value is unknown or have symbols that indicate that the value is unknown.


print(fraud_v3.isnull().sum())
## CustomerID                      0
## TypeOfIncident                  0
## TypeOfCollission                0
## SeverityOfIncident              0
## AuthoritiesContacted            0
## IncidentState                   0
## IncidentCity                    0
## IncidentAddress                 0
## NumberOfVehicles                0
## PropertyDamage                  0
## BodilyInjuries                  0
## Witnesses                       0
## PoliceReport                    0
## AmountOfInjuryClaim             0
## AmountOfPropertyClaim           0
## AmountOfVehicleDamage           0
## AmountOfTotalClaim              0
## InsuredAge                      0
## InsuredZipCode                  0
## InsuredGender                   0
## InsuredEducationLevel           0
## InsuredOccupation               0
## InsuredHobbies                  0
## CapitalGains                    0
## CapitalLoss                     0
## Country                         0
## InsurancePolicyNumber           0
## CustomerLoyaltyPeriod           0
## InsurancePolicyState            0
## Policy_CombinedSingleLimit      0
## Policy_Deductible               0
## PolicyAnnualPremium             0
## UmbrellaLimit                   0
## InsuredRelationship             0
## VehicleID                       0
## VehicleMake                     0
## VehicleModel                    0
## VehicleYOM                      0
## ReportedFraud                   0
## coverageIncidentDiff            0
## dayOfWeek                       0
## IncidentPeriodDay             422
## dtype: int64



with contextlib.redirect_stderr(sys.stdout):
  my_tab=pd.crosstab(index=fraud_v3["TypeOfIncident"], columns=fraud_v3["TypeOfCollission"], normalize=True).round(2)



fig = plt.figure(figsize=(13, 10))

sns.heatmap(my_tab, cmap="BuGn",cbar=False, annot=True,linewidth=0.3)

plt.yticks(rotation=0)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0, 0.5, 'Multi-vehicle Collision'), Text(0, 1.5, 'Parked Car'), Text(0, 2.5, 'Single Vehicle Collision'), Text(0, 3.5, 'Vehicle Theft')])
plt.xticks(rotation=60)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, '?'), Text(1.5, 0, 'Front Collision'), Text(2.5, 0, 'Rear Collision'), Text(3.5, 0, 'Side Collision')])
plt.title('Type of Incident vs Type of Collision', fontsize=20)
plt.xlabel('TypeOfCollision', fontsize=15)
plt.ylabel('TypeOIncident', fontsize=15)

plt.show()

plt.clf()

We observe from the cross table that the ‘unknown’ type of collision is only associated with a small number of incident types related to collisions. These data points will be retained by renaming the “unknown” column to “none”.




fraud_v4['TypeOfCollission'] = fraud_v4['TypeOfCollission'].replace(['?'], 'None')


plt.figure(figsize=(16,10))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=fraud_v4, x='TypeOfCollission')
#plt.tick_params(label_rotation=45)
ax.tick_params(axis='both', which='major', labelsize=11)
ax.set_title("Type of Collision-Changed", size=22)
ax.set(xlabel=None)
ax.set(ylabel=None)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()

fig = plt.figure(figsize=(6, 6))
fig.tight_layout(pad=1.30,h_pad=4, w_pad=3)
fig.suptitle('Categorical Review-Two', fontsize=11)
sns.set_style("dark")





plt.subplot(331)
plt.title('Witnesses', fontsize=8, y=0.90)
dt_1=sns.countplot(data = count_plts_2, x = 'Witnesses')
dt_1.tick_params(axis='both', which='major', labelsize=4)
dt_1.tick_params(axis='x',labelrotation=35)
dt_1.set(xlabel=None) 
dt_1.set(ylabel=None) 





plt.subplot(332)
plt.title('Bodily Injuries',fontsize=8, y=0.90)
dt_2=sns.countplot(data = count_plts_2, x = 'BodilyInjuries')
dt_2.tick_params(axis='both', which='major', labelsize=6)
#dt_2.tick_params(axis='x',labelrotation=35)
dt_2.set(xlabel=None) 
dt_2.set(ylabel=None) 


plt.subplot(333)
plt.title('Property Damage',fontsize=8, y=0.90)
dt_3=sns.countplot(data = count_plts_2, x = 'PropertyDamage')
dt_3.tick_params(axis='both', which='major', labelsize=6)
#dt_3.tick_params(axis='x',labelrotation=35)
dt_3.set(xlabel=None) 
dt_3.set(ylabel=None) 

plt.subplot(334)
plt.title('Number Of Vehicles',fontsize=6, y=0.80)
dt_4=sns.countplot(data = count_plts_2, x = 'NumberOfVehicles')
dt_4.tick_params(axis='both', which='major', labelsize=6)
#dt_4.tick_params(axis='x',labelrotation=45)
dt_4.set(xlabel=None) 
dt_4.set(ylabel=None) 

plt.subplot(335)
plt.title('Incident State',fontsize=8, y=0.90)
dt_5=sns.countplot(data = count_plts_2, x = 'IncidentState')
dt_5.tick_params(axis='both', which='major', labelsize=6)
dt_5.tick_params(axis='x',labelrotation=90)
dt_5.set(xlabel=None) 
dt_5.set(ylabel=None) 

plt.subplot(336)
plt.title('Authorities Contacted',fontsize=8, y=0.90)
dt_6=sns.countplot(data = count_plts_2, x = 'AuthoritiesContacted')
dt_6.tick_params(axis='both', which='major', labelsize=6)
dt_6.tick_params(axis='x',labelrotation=90)
dt_6.set(xlabel=None) 
dt_6.set(ylabel=None) 

plt.subplot(337)
plt.title('SeverityOfIncident',fontsize=8, y=0.90)
dt_7=sns.countplot(data = count_plts_2, x ='SeverityOfIncident')
dt_7.tick_params(axis='both', which='major', labelsize=6)
dt_7.tick_params(axis='x',labelrotation=90)
dt_7.set(xlabel=None) 
dt_7.set(ylabel=None) 


plt.subplots_adjust(wspace=01.0, hspace=2.0)

plt.show()

plt.clf()

From the figure Categorical Review 2 we detect certain features that must be dealt with due to missing values. First, the property damage feature will be dropped due to many observations having no answer which is denoted by a question mark.





fraud_v5=fraud_v5.drop(['PropertyDamage'], axis=1)



Next, the category MISSINGVALUE from the Witnesses feature will be dropped.







fraud_v5['Witnesses']=fraud_v5['Witnesses'].cat.remove_categories("MISSINGVALUE")
plt.figure(figsize=(14,8))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=fraud_v5, x='Witnesses')
#plt.tick_params(label_rotation=45)
ax.set_title("Witnesses-Changed", size=20)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='both', which='major', labelsize=14)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()

fig = plt.figure(figsize=(10, 6))
fig.tight_layout(pad=1.40,h_pad=4, w_pad=3)
fig.suptitle('Categorical Review-Three', fontsize=13)
sns.set_style("dark")





plt.subplot(231)
plt.title('Police Report', fontsize=8, y=0.90)
et_1=sns.countplot(data = count_plts_3, x = 'PoliceReport')
et_1.tick_params(axis='both', which='major', labelsize=6)
et_1.tick_params(axis='x',labelrotation=75)
et_1.set(xlabel=None) 
et_1.set(ylabel=None) 






plt.subplot(232)
plt.title('Insured Gender',fontsize=8, y=0.90)
et_2=sns.countplot(data = count_plts_3, x = 'InsuredGender')
et_2.tick_params(axis='both', which='major', labelsize=6)
et_2.tick_params(axis='x',labelrotation=75)
et_2.set(xlabel=None) 
et_2.set(ylabel=None) 



plt.subplot(233)
plt.title('Insurance Policy State',fontsize=8, y=0.90)
et_4=sns.countplot(data = count_plts_3, x = 'InsurancePolicyState')
et_4.tick_params(axis='both', which='major', labelsize=6)
et_4.tick_params(axis='x',labelrotation=70)
et_4.set(xlabel=None) 

plt.subplot(234)
plt.title('Insured Education Level',fontsize=7, y=0.90)
et_3=sns.countplot(data = count_plts_3, x = 'InsuredEducationLevel')
et_3.tick_params(axis='both', which='major', labelsize=6)
et_3.tick_params(axis='x',labelrotation=90)
et_3.set(xlabel=None) 
et_3.set(ylabel=None) 

plt.subplot(235)
plt.title('Insured Relationship',fontsize=8, y=0.90)
et_5=sns.countplot(data = count_plts_3, x = 'InsuredRelationship')
et_5.tick_params(axis='both', which='major', labelsize=6)
et_5.tick_params(axis='x',labelrotation=90)
et_5.set(xlabel=None) 
et_5.set(ylabel=None) 



plt.subplot(236)
plt.title('Day of Week',fontsize=8, y=0.90)
et_6=sns.countplot(data = count_plts_3, x = 'dayOfWeek')
et_6.tick_params(axis='both', which='major', labelsize=6)
et_6.tick_params(axis='x',labelrotation=90)
et_6.set(xlabel=None) 
et_6.set(ylabel=None) 


plt.subplots_adjust(wspace=01.0, hspace=1.4)

plt.show()

plt.clf()



fraud_v5['Witnesses']=fraud_v5['Witnesses'].cat.remove_unused_categories()


Categorical Review 3 informs us that there are additional categorical features which must be either cleaned or dropped. First, the feature “Police Report” has close to 10000 missing or unknown values (denoted by a question mark). This feature will be dropped.




fraud_v6=fraud_v6.drop(['PoliceReport'], axis=1)



The next feature requiring attention is InsuredGender. There are a small number of missing values, denoted by NA. This category will be removed from InsuredGender. The omission of this small count category will have no effect on our models.





fraud_v6['InsuredGender']=fraud_v6['InsuredGender'].cat.remove_categories("NA")



fraud_v6['InsuredGender']=fraud_v6['InsuredGender'].cat.remove_unused_categories()


plt.figure(figsize=(14,10))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=fraud_v6, x='InsuredGender')
#plt.tick_params(label_rotation=45)
ax.set_title("Insured Gender-Changed", size=25)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='both',labelsize = 15)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()

## *******premium_missing shape*******
## (141, 40)



## *******fraud_v6 shape*******
## (28836, 40)



plt.figure(figsize=(16,6))

ax=sns.countplot(data=fraud_v6, x='VehicleMake')

ax.set_title("Vehicle Make", size=25)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='x',labelrotation=60,labelsize =13)
ax.tick_params(axis='y', labelsize=13)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()



VehicleMake has a small number of missing values (denoted by ‘???’). The category ‘???’ will be removed from the feature.




fraud_v7['VehicleMake']=fraud_v7['VehicleMake'].cat.remove_categories("???")

fraud_v7['VehicleMake']=fraud_v7['VehicleMake'].cat.remove_unused_categories()





veh_mk=vehicle_count.groupby('VehicleMake')['count'].agg('count').reset_index()


plt.figure(figsize=(12,8))

fig, axes=plt.subplots()

line_colors=['blue', 'cyan', 'green', 'red','skyblue','maroon', 'salmon', 'yellow', 
            'orange','lightgreen','darkviolet', 'fuchsia','darkmagenta','lime' ]
            
axes.hlines(veh_mk['VehicleMake'], xmin=0,
            xmax=veh_mk['count'],colors=line_colors)
            
axes.plot(veh_mk['count'],veh_mk['VehicleMake'],"o")
          
axes.set_xlim(0)
## (0.0, 2535.75)
axes.tick_params(axis='both', which='major', labelsize=10)
plt.title('Make of Vehicle Count', fontsize=20)

plt.show()

plt.clf()

VechicleMake feature now has no missing values.



Filtering for any PolicyAnnualPremium value that is equal to -1 we find 141 values returned. From the Attribute Information pdf provided with the data set we know that -1 represents a missing value. All observations with -1 will be removed.





fraud_v7=fraud_v7[fraud_v7['PolicyAnnualPremium']!=-1]



print('**Policy Annual Premium Shape**')
## **Policy Annual Premium Shape**
fraud_v7[fraud_v7['PolicyAnnualPremium']==-1].shape
## (0, 40)



From the shape output we can observe all values of -1 have been removed.



Certain visualizations require numeric only data. We’ll create a date set that contains only numeric data types.





#select only the numeric columns in the DataFrame

numeric_data=fraud_v7.select_dtypes(include=np.number)





numeric_data=numeric_data.drop(['InsuredZipCode', 'InsurancePolicyNumber'], axis=1)
## ******************Numeric Data Types******************
## AmountOfInjuryClaim        int32
## AmountOfPropertyClaim      int32
## AmountOfVehicleDamage      int32
## AmountOfTotalClaim         int32
## InsuredAge                 int32
## CapitalGains               int32
## CapitalLoss                int32
## CustomerLoyaltyPeriod      int32
## Policy_Deductible          int32
## PolicyAnnualPremium      float64
## UmbrellaLimit              int32
## coverageIncidentDiff     float64
## dtype: object



The data set numeric_data only includes features of numeric data types as seen from the above output.



Visualization

plt.figure(figsize=(10, 7))




plt.tick_params(axis='both', which='major', labelsize=9)



plt.title('Correlation Heatmap', fontsize=12)

# define the mask to set the values in the upper triangle to Truemask 

mask=np.triu(np.ones_like(numeric_data.corr(), dtype=bool))

# Generate a custom diverging colormap

#cmap = sns.diverging_palette(220, 10, as_cmap=True)

#ht_mp=sns.heatmap(fraud_train_v8.corr(), cmap=cmap, vmax=.3, center=0,annot=True,
            #square=True, linewidths=.5, cbar_kws={"shrink": .5})
            
            
heatmap = sns.heatmap(numeric_data.corr(), mask=mask,vmin=-1, vmax=1, annot=True, cmap='BrBG', annot_kws={"size": 4})

plt.show()

plt.clf()

            
            

There is very high to high correlation between Amount of Injury Claim, Amount of Property Claim, Amount of Vehicle Damage, and Amount of Total Claim. This is unsurprising as Amount of Total Claim is the sum of the other three. Amount of Total Claim is the only feature of the four that will be used for our machine learning models.

Other features exhibiting very high correlation are Loyalty period and Age. This makes sense as older customers have the chance to accrue loyalty time based on having lived longer than younger customers. Still, we will retain both features for our models.


Features not important for visualizing or building models will now be dropped.



fraud_v8=fraud_v8.drop(['CustomerID', 'IncidentAddress', 'InsuredZipCode', 'InsuredHobbies','Country', 'InsurancePolicyNumber', 'VehicleID'], axis=1)



## **fraud_v8 shape**
## (28695, 33)







fig = plt.figure(figsize=(11, 6))
fig.suptitle('Amount of Total Claim', fontsize=11)

sns.set_style("dark")






plt.subplot(131)
plt.title('Box Plot-Total Claim and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "AmountOfTotalClaim", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None) 



plt.subplot(132)
plt.title('Histogram-Amount of Total Claim', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="AmountOfTotalClaim")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None) 



plt.subplot(133)
plt.title('Histogram-Amount of Total Claim and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="AmountOfTotalClaim", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)

ac_3.set(xlabel=None) 



plt.subplots_adjust(wspace=0.45)

plt.show()

plt.clf()

fig = plt.figure(figsize=(11, 6))
fig.suptitle('Insured Age Review', fontsize=11)

sns.set_style("dark")


plt.subplot(131)
plt.title('Box Plot-Insured Age and Reported Fraud', fontsize=7)
ia_1=sns.boxplot(data = fraud_v8, x = "InsuredAge", y='ReportedFraud')
ia_1.tick_params(axis='x', which='major', labelsize=5)
ia_1.tick_params(axis='y', labelsize=5)
ia_1.tick_params(axis='x', labelrotation=60)
ia_1.set(xlabel=None) 



plt.subplot(132)
plt.title('Histogram- Insured Age', fontsize=7)
ia_2=sns.histplot(data=fraud_v8, x="InsuredAge")
ia_2.tick_params(axis='x', which='major', labelsize=5)
ia_2.tick_params(labelrotation=60)
ia_2.tick_params(axis='y', labelsize=5)
ia_2.set(xlabel=None) 



plt.subplot(133)
plt.title('Histogram-Insured Age and Reported Fraud', fontsize=7)
ia_3=sns.histplot(data=fraud_v8, x="InsuredAge",hue="ReportedFraud")
ia_3.tick_params(axis='x', which='major', labelsize=5)
ia_3.tick_params(axis='y', labelsize=5)

ia_3.set(xlabel=None) 



plt.subplots_adjust(wspace=0.45)

plt.show()

plt.clf()

fig = plt.figure(figsize=(11, 6))
fig.suptitle('Policy Annual Premium', fontsize=11)

sns.set_style("dark")






plt.subplot(131)
plt.title('Box Plot-AnnualPremium and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "PolicyAnnualPremium", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None) 



plt.subplot(132)
plt.title('Histogram-Amount of Annual Premium', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="PolicyAnnualPremium")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None) 



plt.subplot(133)
plt.title('Histogram-Annual Premium and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="PolicyAnnualPremium", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)

ac_3.set(xlabel=None) 



plt.subplots_adjust(wspace=0.45)

plt.show()

plt.clf()

fig = plt.figure(figsize=(11, 6))
fig.suptitle('Customer Loyalty Period', fontsize=11)

sns.set_style("dark")






plt.subplot(131)
plt.title('Box Plot-Customer Loyalty Period and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "CustomerLoyaltyPeriod", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None) 



plt.subplot(132)
plt.title('Histogram-Customer Loyalty Period', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="CustomerLoyaltyPeriod")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None) 



plt.subplot(133)
plt.title('Histogram-Customer Loyalty Period and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="CustomerLoyaltyPeriod", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)

ac_3.set(xlabel=None) 



plt.subplots_adjust(wspace=0.45)

plt.show()

plt.clf()

fig = plt.figure(figsize=(11, 6))
fig.suptitle('Difference Coverage Start and Incident', fontsize=11)

sns.set_style("dark")






plt.subplot(131)
plt.title('Box Plot-Coverage Start Incident Difference and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "coverageIncidentDiff", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None) 



plt.subplot(132)
plt.title('Histogram-Coverage Start Incident Differnce', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="coverageIncidentDiff")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None) 



plt.subplot(133)
plt.title('Histogram-Coverage Start Incident Differncve and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="coverageIncidentDiff", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)

ac_3.set(xlabel=None) 



plt.subplots_adjust(wspace=0.45)

plt.show()

plt.clf()

## **** Year Of Make****
## 2015     416
## 1995     531
## 1996     828
## 2014     871
## 1997    1131
## 2013    1256
## 1998    1276
## 2012    1308
## 2001    1428
## 1999    1479
## 2011    1518
## 2000    1523
## 2002    1527
## 2003    1571
## 2008    1622
## 2009    1623
## 2010    1631
## 2005    1635
## 2006    1637
## 2004    1661
## 2007    1709
## Name: VehicleYOM, dtype: int64


plt.figure(figsize=(16,6))

ax=sns.countplot(data=fraud_v8, x='VehicleYOM')

ax.set_title("Vehicle Year of Make", size=25)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='x',labelrotation=60,labelsize =13)
ax.tick_params(axis='y', labelsize=13)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()

Auto insurance premiums are generally based on personal details like choice of coverage, type of vehicle driven, and age of car. The newer the car, typically the more expensive the insurance. This is based on the vehicle’s replacement cost. The year the car was manufactured plays just as big a part in the premium as the make and model itself. The above plot displays all years of vehicle make in our data set. We find that there are just over 6000 auto’s that have an age of 15 years or greater compared to the last year of 2015.



fig = plt.figure(figsize=(11, 6))
fig.suptitle('Umbrella Limit Review', fontsize=12)
sns.set_style("dark")




plt.subplot(131)
plt.title('Box Plot-Umbrella Limit and Reported Fraud', fontsize=7)
ul_1=sns.boxplot(data = fraud_v8, x = "UmbrellaLimit", y='ReportedFraud')
ul_1.tick_params(axis='x', which='major', labelsize=5)
ul_1.tick_params(axis='y', labelsize=5)
ul_1.tick_params(axis='x', labelrotation=60)
ul_1.set(xlabel=None) 



plt.subplot(132)
plt.title('Histogram-UmbrellaLimit', fontsize=7)
ul_2=sns.histplot(data=fraud_v8, x="UmbrellaLimit",bins=20)
ul_2.tick_params(axis='x', which='major', labelsize=5)
ul_2.tick_params(axis='x',labelrotation=60)
ul_2.tick_params(axis='y', labelsize=5)
ul_2.set(xlabel=None) 



plt.subplot(133)
plt.title('Histogram-UmbrellaLimit and Reported Fraud', fontsize=7)
ul_3=sns.histplot(data=fraud_v8, x="UmbrellaLimit",hue='ReportedFraud',bins=20)
ul_3.tick_params(axis='x', which='major', labelsize=5)
ul_3.tick_params(axis='y', labelsize=5)
ul_3.set(xlabel=None) 



plt.subplots_adjust(wspace=0.45)

plt.show()

plt.clf()

The above plots are unusual. Both box plots have a median of zero. Reported Fraud=Y has a mean of 1,000,000 while Reported Fraud=“N” mean is 918,000. Both histograms have their peak at zero and a log tail to the right.



There are only 7506 data points greater than zero. 2417 for Yes and 5089 for No. Data points greater than zero represent only 26 percent of the entire data set. Normally this would seem unusual, and we would review the raw data for errors. Checking the description of umbrella limit we find that such extreme data points are not uncommon. Umbrella insurance provides “excess liability insurance” beyond the liability insurance already in auto insurance coverage. It’s for expensive situations where medical bills and/or repairs exceed those in “base” auto policies. Auto policy holders who fall in the higher income brackets are usually the purchasers of umbrella limit. Thus, for all data points, the mean of 972,000 and max of 10,000,000 are common values. Additionally, the mean of zero is not unsurprising as not many insured choose umbrella limits for their policies.



sns.set(style="darkgrid")

plt.figure(figsize=(10, 7))

# top bar -> sum all values(ReportedFraud=No and      # ReportedFraud=Yes) to find y position of the bars
total = fraud_v8.groupby('SeverityOfIncident')['count'].sum().reset_index()



# bar chart 1 -> top bars (group of #'ReportedFraud=No')
bar1 = sns.barplot(x="SeverityOfIncident",  y="count", data=total, color='darkblue')

# bottom bar ->  take only ReportedFraud=Yes values #from the data
fraud = fraud_v8[fraud_v8.ReportedFraud=='Y']

# bar chart 2 -> bottom bars (group of #'ReportedFraud=Yes')
bar2 = sns.barplot(x="SeverityOfIncident", y="count", data=fraud, estimator=sum, errorbar=None,  color='lightblue')

# add legend
top_bar = mpatches.Patch(color='darkblue', label='Fraud = No')
bottom_bar = mpatches.Patch(color='lightblue', label='Fraud = Yes')
plt.legend(handles=[top_bar, bottom_bar],fontsize=9, loc="upper right")



plt.tick_params(axis='x', which='major', labelsize=8, labelrotation=75)

plt.tick_params(axis='y', which='major', labelsize=8)



plt.title(" Reported Fraud and Severity Of Incident", fontsize=15)
plt.xlabel(None)
plt.ylabel(None)

plt.show()

plt.clf()

The above plot displays bar plots of categories belonging to the feature ‘severity of incident’ stacked based on whether fraud is ‘Y’ or ‘N’. ‘Major Damage’ stands out as 60% of claims are reported as fraud whereas the other categories have claims reported as fraud under 16%.





with contextlib.redirect_stderr(sys.stdout):
  grouped_veh_mk=fraud_v8.groupby(['VehicleMake','ReportedFraud']).agg({'count':'sum'})



grouped_veh_mk_perc=grouped_veh_mk.groupby(level=0, group_keys=False).apply(lambda x: x /x.sum()).round(2)


grouped_veh_mk_perc.rename(columns={'count':'Percent'}, inplace=True)

#Convert from multi index to single index.

grouped_veh_mk_perc_single=grouped_veh_mk_perc.reset_index(level=[1])

grouped_veh_mk_perc_single=grouped_veh_mk_perc_single.reset_index()

#Pivot wider. This makes "Y" and "N" seperate columns
grouped_veh_mk_perc_wide=grouped_veh_mk_perc_single.pivot(index='VehicleMake',columns='ReportedFraud',values='Percent').reset_index()



#Reorder df following 'N'

grouped_veh_mk_ordered=grouped_veh_mk_perc_wide.sort_values(by='N')





my_range=range(1,len(grouped_veh_mk_ordered.index)+1)


plt.figure(figsize=(9, 9))

plt.hlines(y=my_range, xmin=grouped_veh_mk_ordered['N'], xmax=grouped_veh_mk_ordered['Y'], color='grey', alpha=0.4)
plt.scatter(grouped_veh_mk_ordered['N'], my_range, color='skyblue', alpha=1, label='N')
plt.scatter(grouped_veh_mk_ordered['Y'], my_range, color='green', alpha=0.4 , label='Y')
plt.legend(title="Reported Fraud", loc="lower right", title_fontsize=18,fontsize=6, borderpad=0, facecolor="wheat")

plt.yticks(my_range, grouped_veh_mk_ordered['VehicleMake'])
## ([<matplotlib.axis.YTick object at 0x35fcfb940>, <matplotlib.axis.YTick object at 0x35f7d7640>, <matplotlib.axis.YTick object at 0x35f819870>, <matplotlib.axis.YTick object at 0x35fe9d9f0>, <matplotlib.axis.YTick object at 0x35fe6b160>, <matplotlib.axis.YTick object at 0x34461b7c0>, <matplotlib.axis.YTick object at 0x344600e20>, <matplotlib.axis.YTick object at 0x336d9f040>, <matplotlib.axis.YTick object at 0x344602290>, <matplotlib.axis.YTick object at 0x344603bb0>, <matplotlib.axis.YTick object at 0x344601f60>, <matplotlib.axis.YTick object at 0x360121750>, <matplotlib.axis.YTick object at 0x344602f20>, <matplotlib.axis.YTick object at 0x35fdc0cd0>], [Text(0, 1, 'Audi'), Text(0, 2, 'BMW'), Text(0, 3, 'Ford'), Text(0, 4, 'Mercedes'), Text(0, 5, 'Volkswagen'), Text(0, 6, 'Dodge'), Text(0, 7, 'Chevrolet'), Text(0, 8, 'Suburu'), Text(0, 9, 'Honda'), Text(0, 10, 'Toyota'), Text(0, 11, 'Saab'), Text(0, 12, 'Nissan'), Text(0, 13, 'Accura'), Text(0, 14, 'Jeep')])
plt.title("Reported Fraud by Vehicle Make", fontsize=15,loc='center')
plt.xlabel('Percent', fontsize=6)
plt.ylabel(None)


plt.tick_params(axis='x', which='major', labelsize=5, labelrotation=90)

plt.tick_params(axis='y', which='major', labelsize=5)

plt.show()

plt.clf()

We find that Volkswagen, Mercedes, Ford, BMW, and Audi are the vehicle makes with reported fraud over 30%. This is an interesting statistic though due to the large number of categories we’ll explore the ‘Vehicle Make’ feature further.



Box plots show the median total claims is roughly the same for all models.



Nissan, Subaru, and Toyota have a median capital gain near 20,000, substantially larger than all other makes. The vehicle makes over 30% reported fraud all have zero medians.

Due to the number of categories of “Vehicle Make” we will exclude it from the modeling process.





grouped_inc_st=fraud_v8.groupby(['IncidentState','ReportedFraud']).agg({'count':'sum'})




grouped_inc_st_perc=grouped_inc_st.groupby(level=0, group_keys=False).apply(lambda x: x /x.sum()).round(2)



#numeric_only


grouped_inc_st_perc.rename(columns={'count':'Percent'}, inplace=True)


grouped_inc_st_perc_single=grouped_inc_st_perc.reset_index(level=[1])


grouped_inc_st_perc_single=grouped_inc_st_perc_single.reset_index()

grouped_inc_st_perc_wide=grouped_inc_st_perc_single.pivot(index='IncidentState',columns='ReportedFraud',values='Percent').reset_index()



grouped_inc_st_ordered=grouped_inc_st_perc_wide.sort_values(by='N')

my_range_2=range(1,len(grouped_inc_st_ordered.index)+1)
plt.figure(figsize=(9, 9))

plt.hlines(y=my_range_2, xmin=grouped_inc_st_ordered['N'], xmax=grouped_inc_st_ordered['Y'], color='grey', alpha=0.4)
plt.scatter(grouped_inc_st_ordered['N'], my_range_2, color='skyblue', alpha=1, label='N')
plt.scatter(grouped_inc_st_ordered['Y'], my_range_2, color='green', alpha=0.4 , label='Y')
plt.legend(title="Reported Fraud", loc="lower right", title_fontsize=8,fontsize=6, borderpad=0, facecolor="wheat")

plt.yticks(my_range_2, grouped_inc_st_ordered['IncidentState'])
## ([<matplotlib.axis.YTick object at 0x35f05e320>, <matplotlib.axis.YTick object at 0x35f7eceb0>, <matplotlib.axis.YTick object at 0x35f05dbd0>, <matplotlib.axis.YTick object at 0x35f0e88b0>, <matplotlib.axis.YTick object at 0x35f0ea1a0>, <matplotlib.axis.YTick object at 0x35f0eb8b0>, <matplotlib.axis.YTick object at 0x35f0af370>], [Text(0, 1, 'State3'), Text(0, 2, 'State4'), Text(0, 3, 'State7'), Text(0, 4, 'State6'), Text(0, 5, 'State8'), Text(0, 6, 'State5'), Text(0, 7, 'State9')])
plt.title("Reported Fraud by Incident State", loc='center')
plt.xlabel('Percent', fontsize=6)
plt.ylabel('Incident State')


plt.tick_params(axis='x', which='major', labelsize=5, labelrotation=90)

plt.tick_params(axis='y', which='major', labelsize=5)

plt.show()

plt.clf()

Incident States 4,6, and 7 have reported fraud just over 30% which appears significant. However, Incident state 3 stands out from the other states with a reported fraud around 42%





grouped_type_inc=fraud_v8.groupby(['TypeOfIncident','ReportedFraud']).agg({'count':'sum'})


grouped_type_inc_perc=grouped_type_inc.groupby(level=0, group_keys=False).apply(lambda x: x /x.sum()).round(2)


grouped_type_inc_perc.rename(columns={'count':'Percent'}, inplace=True)


grouped_type_inc_perc_single=grouped_type_inc_perc.reset_index(level=[1])




grouped_type_inc_perc_single=grouped_type_inc_perc_single.reset_index()



grouped_type_inc_perc_wide=grouped_type_inc_perc_single.pivot(index='TypeOfIncident',columns='ReportedFraud',values='Percent').reset_index()


grouped_type_inc_ordered=grouped_type_inc_perc_wide.sort_values(by='N')



my_range_3=range(1,len(grouped_type_inc_ordered.index)+1)
plt.figure(figsize=(9, 9))

plt.hlines(y=my_range_3, xmin=grouped_type_inc_ordered['N'], xmax=grouped_type_inc_ordered['Y'], color='grey', alpha=0.4)
plt.scatter(grouped_type_inc_ordered['N'], my_range_3, color='skyblue', alpha=1, label='N')
plt.scatter(grouped_type_inc_ordered['Y'], my_range_3, color='green', alpha=0.4 , label='Y')
plt.legend(title="Reported Fraud", loc="lower right", title_fontsize=8,fontsize=6, borderpad=0, facecolor="wheat")

plt.yticks(my_range_3, grouped_type_inc_ordered['TypeOfIncident'])
## ([<matplotlib.axis.YTick object at 0x35e76aaa0>, <matplotlib.axis.YTick object at 0x35e76beb0>, <matplotlib.axis.YTick object at 0x3447061d0>, <matplotlib.axis.YTick object at 0x35e768b20>], [Text(0, 1, 'Single Vehicle Collision'), Text(0, 2, 'Multi-vehicle Collision'), Text(0, 3, 'Vehicle Theft'), Text(0, 4, 'Parked Car')])
plt.title("Reported Fraud by Type of Incident", loc='center')
plt.xlabel('Percent', fontsize=6)
plt.ylabel('Incident')

plt.tick_params(axis='x', which='major', labelsize=5, labelrotation=90)

plt.tick_params(axis='y', which='major', labelsize=5)

plt.show()

plt.clf()

From the above plots we observe two categories stand out with respect to reported fraud. ‘Single Vehicle Collision’ and ‘Multi-vehicle collision’ from the feature ‘Type of Incident’ have claims reported as fraud at 31% and 29% respectively. The other two categories are under 14%.



Outliers


We’ll now review our data for outliers. Our goal is not necessarily to remove observations that are indicated as outliers, it is to derive insights that may help us understand reported fraud in combination with our machine learning models.


Numertic Review


fig = plt.figure(figsize=(6, 6))
fig.tight_layout(pad=1.30,h_pad=4, w_pad=3)
fig.suptitle('Numerical Feature Distributions', fontsize=11)
sns.set_style("dark")





plt.subplot(431)
plt.title('Policy Annual Premiums',fontsize=8, y=0.90)
dt_1=sns.histplot(data = numeric_data, x ='PolicyAnnualPremium')
dt_1.tick_params(axis='both', which='major', labelsize=4)
#dt_1.tick_params(axis='x',labelrotation=35)
dt_1.set(xlabel=None) 
dt_1.set(ylabel=None) 





plt.subplot(432)
plt.title('Umbrella Limit',fontsize=8, y=0.90)
dt_2=sns.histplot(data = numeric_data, x ='UmbrellaLimit')
dt_2.tick_params(axis='both', which='major', labelsize=6)
#dt_2.tick_params(axis='x',labelrotation=35)
dt_2.set(xlabel=None) 
dt_2.set(ylabel=None) 


plt.subplot(433)
plt.title('coverage Incident Difference',fontsize=8, y=0.90)
dt_3=sns.histplot(data = numeric_data, x ='coverageIncidentDiff')
dt_3.tick_params(axis='both', which='major', labelsize=6)
#dt_3.tick_params(axis='x',labelrotation=35)
dt_3.set(xlabel=None) 
dt_3.set(ylabel=None) 

plt.subplot(434)
plt.title('Amount Of Total Claim',fontsize=6, y=0.80)
dt_4=sns.histplot(data = numeric_data, x = 'AmountOfTotalClaim')
dt_4.tick_params(axis='both', which='major', labelsize=6)
#dt_4.tick_params(axis='x',labelrotation=45)
dt_4.set(xlabel=None) 
dt_4.set(ylabel=None) 

plt.subplot(435)
plt.title('Insured Age',fontsize=8, y=0.90)
dt_5=sns.histplot(data = numeric_data, x = 'InsuredAge')
dt_5.tick_params(axis='both', which='major', labelsize=6)
#dt_5.tick_params(axis='x',labelrotation=90)
dt_5.set(xlabel=None) 
dt_5.set(ylabel=None) 

plt.subplot(436)
plt.title('Capital Gains',fontsize=8, y=0.90)
dt_6=sns.histplot(data = numeric_data, x = 'CapitalGains')
dt_6.tick_params(axis='both', which='major', labelsize=6)
#dt_6.tick_params(axis='x',labelrotation=90)
dt_6.set(xlabel=None) 
dt_6.set(ylabel=None) 

plt.subplot(437)
plt.title('Capital Loss',fontsize=8, y=0.90)
dt_7=sns.histplot(data = numeric_data, x ='CapitalLoss')
dt_7.tick_params(axis='both', which='major', labelsize=6)
#dt_7.tick_params(axis='x',labelrotation=90)
dt_7.set(xlabel=None) 
dt_7.set(ylabel=None) 

plt.subplot(438)
plt.title('Customer Loyalty Period',fontsize=8, y=0.90)
dt_8=sns.histplot(data = numeric_data, x ='CustomerLoyaltyPeriod')
dt_8.tick_params(axis='both', which='major', labelsize=6)
#dt_7.tick_params(axis='x',labelrotation=90)
dt_8.set(xlabel=None) 
dt_8.set(ylabel=None) 

plt.subplot(439)
plt.title('Policy Deductible',fontsize=8, y=0.90)
dt_9=sns.histplot(data = numeric_data, x ='Policy_Deductible')
dt_9.tick_params(axis='both', which='major', labelsize=6)
#dt_7.tick_params(axis='x',labelrotation=90)
dt_9.set(xlabel=None) 
dt_9.set(ylabel=None) 





plt.subplots_adjust(wspace=01.0, hspace=2.0)

plt.show()

plt.clf()

The above plots indicate that certain numerical features exhibit distributions that are not normal this we will zoom in on these features.



n_bins=np.sqrt(len(numeric_data))


Cast to an integer


n_bins=int(n_bins)

integers_um=range(len(numeric_data["UmbrellaLimit"]))


gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))
fig.suptitle('Umbrella Limit', fontsize=8)


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[1, 0])
ax3=fig.add_subplot(gs[:, 1])

#plt.title('Histogram',fontsize=7, y=1)
hg=sns.histplot(data = numeric_data, x = 'UmbrellaLimit', bins=n_bins,ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Umbrella Limit", fontsize=5) 
hg.set_ylabel("Count",fontsize=5)
#plt.title('Scatter Plot',fontsize=7, y=1)
sp=sns.scatterplot(data=numeric_data, x=integers_um, y='UmbrellaLimit', ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=3.5)
sp.set_xlabel("Index", fontsize=5) 
sp.set_ylabel("Umbrella Limit",fontsize=4) 
plt.title('Boxplot',fontsize=7, y=1)
bp=sns.boxplot(data=numeric_data, y='UmbrellaLimit', ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Unbrella Limit", fontsize=5) 
bp.set_ylabel(None) 

plt.tight_layout()

plt.show()

plt.clf()

The right tail shows many bars with a height of nearly zero, far off from the bulk of the histogram. This suggests they might be outliers.

The scatter plot, we see many suspicious points that are around 0.9 . The boxplot has points above 0.1 that may be outliers.


integers_tc=range(len(numeric_data["AmountOfTotalClaim"]))


gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[1, 0])
ax3=fig.add_subplot(gs[:, 1])

hg=sns.histplot(data = numeric_data, x = 'AmountOfTotalClaim', ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Umbrella Limit", fontsize=5) 
hg.set_ylabel("Count",fontsize=5)
sp=sns.scatterplot(data=numeric_data, x=integers_tc, y="AmountOfTotalClaim", ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=3.5)
sp.set_xlabel("Index", fontsize=5) 
sp.set_ylabel("Total Claim",fontsize=4) 
bp=sns.boxplot(data=numeric_data, y="AmountOfTotalClaim", ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Unbrella Limit", fontsize=5) 
bp.set_ylabel(None) 

plt.tight_layout()
plt.show()

plt.clf()

## ******Total Claim Description******
## count     28695.000000
## mean      52303.964733
## std       25109.177907
## min         150.000000
## 25%       44612.500000
## 50%       58362.000000
## 75%       68975.500000
## max      114920.000000
## Name: AmountOfTotalClaim, dtype: float64


print("Ten smallest Total Claim Amounts\n",numeric_data['AmountOfTotalClaim'].nsmallest(10))
## Ten smallest Total Claim Amounts
##  17433    150
## 17427    313
## 23140    334
## 22996    489
## 4654     547
## 9308     598
## 17430    681
## 18674    725
## 23136    812
## 12639    838
## Name: AmountOfTotalClaim, dtype: int32


print("Ten Largest Total Claim Amounts\n",numeric_data['AmountOfTotalClaim'].nlargest(10))
## Ten Largest Total Claim Amounts
##  97       114920
## 2421     114141
## 14909    114113
## 18186    113997
## 27450    113771
## 6396     112817
## 2535     112560
## 7357     111870
## 25936    111771
## 17390    111708
## Name: AmountOfTotalClaim, dtype: int32


The histogram has two peaks with one peak near zero and the second peak near 60000 which may not be outliers. The scatter plot and boxplot both have points around 11000 and near zero that could be outliers. From the descriptive statistics we find the minimum value of 150 is substantially lower than the 25 percent quantile of 44,612. Likewise, the maximum value of 114,920 is substantially higher than the 75 percent quantile of 68,975. The ten lowest and ten highest numbers support this. Thus, it’s possible that these values are outliers.






integers_cg=range(len(numeric_data["CapitalGains"]))


gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[1, 0])
ax3=fig.add_subplot(gs[:, 1])

hg=sns.histplot(data = numeric_data, x = "CapitalGains", ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Umbrella Limit", fontsize=5) 
hg.set_ylabel("Count",fontsize=5)
sp=sns.scatterplot(data=numeric_data, x=integers_cg, y="CapitalGains", ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=3.5)
sp.set_xlabel("Index", fontsize=5) 
sp.set_ylabel("Umbrella Limit",fontsize=4) 
bp=sns.boxplot(data=numeric_data, y="CapitalGains", ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Unbrella Limit", fontsize=5) 
bp.set_ylabel(None) 

plt.tight_layout()


plt.show()

plt.clf()

## Capital Gains Desciption
##  count     28695.000000
## mean      23074.225475
## std       27638.373450
## min           0.000000
## 25%           0.000000
## 50%           0.000000
## 75%       49000.000000
## max      100500.000000
## Name: CapitalGains, dtype: float64


## Ten smallest Capital Gain Amounts
##  4     0
## 5     0
## 11    0
## 12    0
## 14    0
## 15    0
## 16    0
## 17    0
## 18    0
## 23    0
## Name: CapitalGains, dtype: int32


## Ten Largest Capital Gain Amounts
##  593      100500
## 2064     100500
## 3000     100500
## 3274     100500
## 4642     100500
## 5423     100500
## 6093     100500
## 9284     100500
## 9420     100500
## 12750    100500
## Name: CapitalGains, dtype: int32


The scatter plot has points around 100000 that may be outliers though neither the histogram nor boxplot indicate this. The maximum value of 100,500 is substantially higher than the 75 percent quantile of 43000. The number 100,500 appears in all ten largest numbers thus it’s likely these are not outliers.




integers_cl=range(len(numeric_data["CapitalLoss"]))


gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[1, :])

sns.histplot(data = numeric_data, x = "CapitalLoss", ax=ax1)
sns.scatterplot(data=numeric_data, x=integers_cl, y="CapitalLoss", ax=ax2)
sns.boxplot(data=numeric_data, x="CapitalLoss", ax=ax3)

plt.tight_layout()

plt.show()

plt.clf()
            

print("Capital Loss Description\n", numeric_data["CapitalLoss"].describe())
## Capital Loss Description
##  count     28695.000000
## mean     -24942.289597
## std       27919.212327
## min     -111100.000000
## 25%      -50000.000000
## 50%           0.000000
## 75%           0.000000
## max           0.000000
## Name: CapitalLoss, dtype: float64


print("Ten smallest Capital Loss Amounts\n",numeric_data["CapitalLoss"].nsmallest(10))
## Ten smallest Capital Loss Amounts
##  583     -111100
## 584     -111100
## 1341    -111100
## 4691    -111100
## 5417    -111100
## 6658    -111100
## 11041   -111100
## 12724   -111100
## 12725   -111100
## 12726   -111100
## Name: CapitalLoss, dtype: int32


print("Ten Largest Capital Loss Amounts\n",numeric_data["CapitalLoss"].nlargest(10))
## Ten Largest Capital Loss Amounts
##  6     0
## 7     0
## 9     0
## 11    0
## 12    0
## 13    0
## 14    0
## 18    0
## 22    0
## 23    0
## Name: CapitalLoss, dtype: int32


The scatter plot has points near -100000 that may be outliers. This seems to be the case in the histogram as we see points in the tail that are possible outliers. We find a larger difference between the minimum value of 111,100 and 25 percent quantile. Checking the ten lowest values we see that the number 111,100 occupies all ten, thus it’s likely that these are not outliers.






integers_pd=range(len(numeric_data["Policy_Deductible"]))


gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[1, :])

sns.histplot(data = numeric_data, x = "Policy_Deductible", ax=ax1)
sns.scatterplot(data=numeric_data, x=integers_pd, y="Policy_Deductible", ax=ax2)
sns.boxplot(data=numeric_data, x="Policy_Deductible", ax=ax3)

plt.tight_layout()

plt.show()

plt.clf()

print("Policy Deductable description\n",numeric_data["Policy_Deductible"].describe())
## Policy Deductable description
##  count    28695.000000
## mean      1114.250671
## std        546.567184
## min        500.000000
## 25%        622.000000
## 50%       1000.000000
## 75%       1625.500000
## max       2000.000000
## Name: Policy_Deductible, dtype: float64


## Ten smallest Policy Deductable Amounts
##  4     500
## 5     500
## 10    500
## 14    500
## 15    500
## 21    500
## 22    500
## 48    500
## 49    500
## 73    500
## Name: Policy_Deductible, dtype: int32


## Ten largest Policy Deductable Amounts
##  8     2000
## 18    2000
## 27    2000
## 28    2000
## 29    2000
## 33    2000
## 36    2000
## 37    2000
## 39    2000
## 40    2000
## Name: Policy_Deductible, dtype: int32


c



Categorical Review



We’ll check our categorical features first by viewing their distributions. We will then use boxplots to determine if any categories for a categorical features is different from the other categories across all of our chosen numeric features. Categories that exhibit differences from other categories acrros all numeric features may be outliers.




fraud_v9=fraud_v8.copy()





fraud_v9=fraud_v9.drop(['IncidentCity','AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleModel',
'VehicleYOM', 'count', 'InsuredEducationLevel','InsuredOccupation', 'VehicleMake'], axis=1)



gs=plt.GridSpec(5, 3)
fig=plt.figure(figsize=(7,5))
fig.suptitle('Categorical Feature Distributions', fontsize=8)


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
ax4=fig.add_subplot(gs[1,0])
ax5=fig.add_subplot(gs[1,1])
ax6=fig.add_subplot(gs[1,2])
ax7=fig.add_subplot(gs[2, 0])
ax8=fig.add_subplot(gs[2, 1])
ax9=fig.add_subplot(gs[2,2])
ax10=fig.add_subplot(gs[3,0])
ax11=fig.add_subplot(gs[3,1])
ax12=fig.add_subplot(gs[3,2])
ax13=fig.add_subplot(gs[4,0])


#plt.title('Type of Incident',fontsize=7, y=1)
ct1=sns.countplot(data = fraud_v9, y='TypeOfIncident', orient='h', ax=ax1)
ct1.tick_params(axis='both', which='major', labelsize=4)
ct1.set_xlabel(None) 
ct1.set_ylabel("Type Of Incident", fontsize=5) 
#plt.title('Type of Collision',fontsize=7, y=1)
ct2=sns.countplot(data = fraud_v9, y='TypeOfCollission', orient='h', ax=ax2)
ct2.tick_params(axis='both', which='major', labelsize=5)
ct2.set_ylabel("Type Of Collision", fontsize=5) 
ct2.set_xlabel(None) 
#plt.title('Reported Fraud',fontsize=7, y=1)
ct3=sns.countplot(data=fraud_v9, y='SeverityOfIncident', orient='h', ax=ax3)
ct3.tick_params(axis='both', which='major', labelsize=5)
ct3.set_ylabel("Severity Of Incident", fontsize=5) 
ct3.set_xlabel(None) 
ct4=sns.countplot(data=fraud_v9, y='AuthoritiesContacted', orient='h', ax=ax4)
ct4.tick_params(axis='both', which='major', labelsize=5)
ct4.set_ylabel("Authorities Contacted", fontsize=5) 
ct4.set_xlabel(None) 
ct5=sns.countplot(data=fraud_v9, y='IncidentState', orient='h', ax=ax5)
ct5.tick_params(axis='both', which='major', labelsize=5)
ct5.set_ylabel('Incident State',fontsize=5) 
ct5.set_xlabel(None) 
ct6=sns.countplot(data=fraud_v9,y='NumberOfVehicles', orient='h', ax=ax6)
ct6.tick_params(axis='both', which='major', labelsize=5)
ct6.set_ylabel("Number Of Vehicles ", fontsize=5) 
ct6.set_xlabel(None) 
ct7=sns.countplot(data = fraud_v9, y='BodilyInjuries', orient='h', ax=ax7)
ct7.tick_params(axis='both', which='major', labelsize=4)
ct7.set_xlabel(None) 
ct7.set_ylabel("Bodily Injuries", fontsize=5) 
#plt.title('Type of Collision',fontsize=7, y=1)
ct8=sns.countplot(data = fraud_v9, y='Witnesses', orient='h', ax=ax8)
ct8.tick_params(axis='both', which='major', labelsize=5)
ct8.set_ylabel("Witnesses", fontsize=5) 
ct8.set_xlabel(None) 
#plt.title('Reported Fraud',fontsize=7, y=1)
ct9=sns.countplot(data=fraud_v9, y='InsurancePolicyState', orient='h', ax=ax9)
ct9.tick_params(axis='both', which='major', labelsize=5)
ct9.set_ylabel("Insurance Policy State", fontsize=5) 
ct9.set_xlabel(None) 
ct10=sns.countplot(data=fraud_v9, y='Policy_CombinedSingleLimit', orient='h', ax=ax10)
ct10.tick_params(axis='both', which='major', labelsize=5)
ct10.set_ylabel("Polic _Combined/Single Limit", fontsize=4) 
ct10.set_xlabel(None) 
ct11=sns.countplot(data=fraud_v9, y='InsuredRelationship', orient='h', ax=ax11)
ct11.tick_params(axis='both', which='major', labelsize=5)
ct11.set_ylabel('Insured Relationship ',fontsize=5) 
ct11.set_xlabel(None) 
ct12=sns.countplot(data=fraud_v9,y='dayOfWeek', orient='h', ax=ax12)
ct12.tick_params(axis='both', which='major', labelsize=5)
ct12.set_ylabel("Day Of Week", fontsize=5) 
ct12.set_xlabel(None) 
ct13=sns.countplot(data=fraud_v9,y='IncidentPeriodDay', orient='h', ax=ax13)
ct13.tick_params(axis='both', which='major', labelsize=5)
ct13.set_ylabel("Incident Period Day", fontsize=5) 
ct13.set_xlabel(None) 
plt.tight_layout()

plt.show()

plt.clf()

gs=plt.GridSpec(2, 3)
fig=plt.figure(figsize=(8,6))
fig.suptitle('Type Of Incident', fontsize=8)


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
ax4=fig.add_subplot(gs[1,0])
ax5=fig.add_subplot(gs[1,1])
ax6=fig.add_subplot(gs[1,2])

#plt.title('Type of Incident',fontsize=7, y=1)
bx1=sns.boxplot(data=fraud_v9, x='AmountOfTotalClaim', y='TypeOfIncident', orient='h', ax=ax1)
bx1.tick_params(axis='both', which='major', labelsize=4)
bx1.set_xlabel("Total Claim", fontsize=5) 
bx1.set_ylabel(None) 
#plt.title('Type of Collision',fontsize=7, y=1)
bx2=sns.boxplot(data=fraud_v9, x='InsuredAge', y='TypeOfIncident', orient='h', ax=ax2)
bx2.tick_params(axis='both', which='major', labelsize=5)
bx2.set_xlabel("Age", fontsize=5) 
bx2.set_ylabel(None) 
#plt.title('Reported Fraud',fontsize=7, y=1)
bx3=sns.boxplot(data=fraud_v9, x='CustomerLoyaltyPeriod', y='TypeOfIncident', orient='h', ax=ax3)
bx3.tick_params(axis='both', which='major', labelsize=5)
bx3.set_xlabel("Loyalty Period", fontsize=5) 
bx3.set_ylabel(None) 
bx4=sns.boxplot(data=fraud_v9, x='Policy_Deductible', y='TypeOfIncident', orient='h', ax=ax4)
bx4.tick_params(axis='both', which='major', labelsize=5)
bx4.set_xlabel("Deductable", fontsize=5) 
bx4.set_ylabel(None) 
bx5=sns.boxplot(data=fraud_v9, x='PolicyAnnualPremium', y='TypeOfIncident', orient='h', ax=ax5)
bx5.tick_params(axis='both', which='major', labelsize=5)
bx5.set_ylabel(None) 
bx6=sns.boxplot(data=fraud_v9, x='UmbrellaLimit', y='TypeOfIncident', orient='h', ax=ax6)
bx6.tick_params(axis='both', which='major', labelsize=5)
bx6.set_xlabel("Umbrella Limit", fontsize=5) 
bx6.set_ylabel(None)  
plt.tight_layout()

plt.show()

plt.clf()

Type of Incidence Boxplots we find that the categories Parked car and Vehicle Theft for Total Claim are different than the other categories though this is not consistent with our other numeric features and thus cannot make any conclusions.

The category Trivial Damage for Severity of Incident is different from the other categories under Total claims though this is not consistent though the other numeric features.

No categorical features had categories that were different from the other categories across the numerical features.



Outlier Detection Model



Convert target variable to binary (Y -> 1, N -> 0)





out_mod1['ReportedFraud'] = out_mod1['ReportedFraud'].map({'Y': 1, 'N': 0})





out_mod2=out_mod1.copy()




out_mod2=out_mod2.drop('ReportedFraud', axis=1)


Identify categorical and numerical columns




categorical_cols = out_mod2.select_dtypes(include=['category']).columns.tolist()
numerical_cols = out_mod2.select_dtypes(include=['int32', 'float64']).columns.tolist()



One-Hot Encoding for categorical variables


encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_cats = encoder.fit_transform(out_mod2[categorical_cols])
encoded_df = pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out(categorical_cols))


Standardize numerical features



scaler = StandardScaler()
scaled_nums = scaler.fit_transform(out_mod2[numerical_cols])
scaled_df = pd.DataFrame(scaled_nums, columns=numerical_cols)


Combine processed numerical and categorical data





processed_df = pd.concat([scaled_df, encoded_df,out_mod1['ReportedFraud']], axis=1)





iso_forest = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)





feature_columns = [col for col in processed_df.columns if col != 'ReportedFraud']


with contextlib.redirect_stderr(sys.stdout):
  iso_forest.fit(processed_df[feature_columns])
IsolationForest(contamination=0.05, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



with contextlib.redirect_stderr(sys.stdout):
  # Get anomaly scores and predictions
  processed_df_2['Anomaly_Score'] = iso_forest.decision_function(processed_df_2[feature_columns])
  processed_df_2['Anomaly_Label'] = iso_forest.predict(processed_df_2[feature_columns])



processed_df_2['Anomaly_Label'] = processed_df_2['Anomaly_Label'].apply(lambda x: 1 if x == -1 else 0)


plt.figure(figsize=(10, 5))
plt.hist(processed_df_2['Anomaly_Score'], bins=50, alpha=0.7, color='blue', edgecolor='black')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.title('Distribution of Anomaly Scores')

plt.show()

plt.clf()

The distribution of scores in the left tail show that the more anomalous observations have negative scores roughly between -0.04 and -0.06.


plt.figure(figsize=(10, 5))
plt.scatter(processed_df_2.index, processed_df_2['Anomaly_Score'], c=processed_df_2['Anomaly_Label'], cmap='coolwarm', alpha=0.6)
plt.xlabel('Observation Index')
plt.ylabel('Anomaly Score')
plt.title('Anomaly Score vs. Observations (Red = Anomalies)')
plt.colorbar(label="Anomaly Label (1 = Anomaly, 0 = Normal)")
## <matplotlib.colorbar.Colorbar object at 0x35c0d87f0>
plt.show()

plt.clf()

Observations with Label 1(Anomaly = yes) have anomaly scores under zero as opposed to observations with label 0 (Anomaly = no) which have scores above zero.


with contextlib.redirect_stderr(sys.stdout):
  # Compare fraud cases vs anomaly detection
  fraud_anomalies = processed_df_2.groupby(['ReportedFraud', 'Anomaly_Label']).size().unstack()
  fraud_anomalies.plot(kind='bar', stacked=True, figsize=(8, 5))
  plt.xlabel('Reported Fraud (0 = No, 1 = Yes)')
  plt.ylabel('Count')
  plt.title('Comparison of Reported Fraud vs Anomaly Detection')
  plt.legend(title="Anomaly Label", labels=['Normal', 'Anomaly'])

  plt.show()
  plt.clf()




from sklearn.tree import DecisionTreeClassifier
# Feature importance using Decision Tree as a surrogate model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(processed_df_2[feature_columns], processed_df_2['Anomaly_Label'])
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': tree_model.feature_importances_
}).sort_values(by='Importance', ascending=False).round(3)



feat_out_tp5=feature_importance.nlargest(5,"Importance")
values = feat_out_tp5.Importance    
idx = feat_out_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Random Forest Anomaly Model')

plt.ylabel("Features", fontsize=510)

plt.tick_params(axis='x', which='major', labelsize=9)

plt.tick_params(axis='y', labelsize=7,labelrotation=45)

plt.show()


plt.clf()

Feature Importance is a score assigned to features that defines how “important” a feature is to the model’s prediction. This means the extent to which the feature contributes to the final output. However, feature importance does not inform us if the contribution is a positive or negative impact on the final output.


In our inspection of these top features in the visualization section, the distribution between fraud and no fraud did not show differences in distribution. In order to thoroughly perform anomaly detection, we woujld include other model algorithms. Since the focus of this project is predicting fraud, anomaly detection will be a seperate project.



Data Preprocessing



Before model building can start, we’ll need to perform pre-processing. This will entail splitting our data into training, validation, and test sets along with transforming numerical and categorical features into classification friendly formats.



Select features

## The  Target categories: Index(['N', 'Y'], dtype='object'):

Will relocate ReportedFraud feature to the last index of the model_data



col=model_data.pop('ReportedFraud')
model_data.insert(22,'ReportedFraud', col)



Split Dateset



We will separate the data to get predictor features and target features into separate data frames.




model_data2=model_data2.rename(columns={'ReportedFraud': 'labels'})



The data type of the target feature is categorical. Most machine learning algorithms require numerical data types. The target feature y will transformed to a numeric tpye.





label_encoder=LabelEncoder()


def split_data(data):
  y=data.iloc[:, -1]
  y=pd.DataFrame(y)
  y['labels']=label_encoder.fit_transform(y['labels'])
  y['labels']=y['labels'].astype("category")
  X=data.iloc[:, :-1]
  
  return X, y
  
  







X,y=split_data(model_data2)
## CategoricalDtype(categories=[0, 1], ordered=False)


## Target Feature categories as binary:  Int64Index([0, 1], dtype='int64'):


## Shape of Predictor Features is (28181, 22):


## Shape of Target Feature is (28181, 1):



The makup of the X data frame is 28836 rows and 26 columns. The y data frame has the same number of rows, 28836, and one column, the target feature.




## *********** X Structure***********
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28181 entries, 0 to 28835
## Data columns (total 23 columns):
##  #   Column                      Non-Null Count  Dtype   
## ---  ------                      --------------  -----   
##  0   TypeOfIncident              28181 non-null  category
##  1   TypeOfCollission            28181 non-null  category
##  2   SeverityOfIncident          28181 non-null  category
##  3   AuthoritiesContacted        28181 non-null  category
##  4   IncidentState               28181 non-null  category
##  5   NumberOfVehicles            28181 non-null  category
##  6   BodilyInjuries              28181 non-null  category
##  7   Witnesses                   28181 non-null  category
##  8   AmountOfTotalClaim          28181 non-null  int32   
##  9   InsuredAge                  28181 non-null  int32   
##  10  InsuredGender               28181 non-null  category
##  11  CapitalGains                28181 non-null  int32   
##  12  CapitalLoss                 28181 non-null  int32   
##  13  CustomerLoyaltyPeriod       28181 non-null  int32   
##  14  InsurancePolicyState        28181 non-null  category
##  15  Policy_CombinedSingleLimit  28181 non-null  category
##  16  Policy_Deductible           28181 non-null  int32   
##  17  PolicyAnnualPremium         28181 non-null  float64 
##  18  UmbrellaLimit               28181 non-null  int32   
##  19  InsuredRelationship         28181 non-null  category
##  20  coverageIncidentDiff        28181 non-null  float64 
##  21  dayOfWeek                   28181 non-null  category
##  22  IncidentPeriodDay           28181 non-null  category
## dtypes: category(14), float64(2), int32(7)
## memory usage: 1.8 MB


Train/Test sets



We will now split X,y into Train and Test sets






X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.25,random_state=42, stratify=y)



## Shape of X Train: (21135, 23)
## Shape of X Test: (7046, 23)
## Shape of y Train: (21135, 1)
## Shape of y Test: (7046, 1)



y train and y test will be transformed into one dimensional arrays using a function.


def transform_to_array(y_train, y_test):
  #transform from data frame to numpy array
  y_train_array=np.array(y_train)
  y_test_array=np.array(y_test)
  #transform to one dimensianal array
  y_train_np=np.ravel(y_train_array)
  y_test_np=np.ravel(y_test_array)
  
  return y_train_np, y_test_np




y_train_np, y_test_np=transform_to_array(y_train, y_test)





## Shape of y Train np: (21135,)
## Shape of y Test np: (7046,)



From the above output we see that y train and y test have been transformed into one dimensional numpy arrays.





Transform Categorical and Numerical features



Our next step is to transform the predictor features into acceptable machine learning formats.

Transformation for numerical features is performed by scaling. Scaling prevents a feature with a range let’s say in the thousands from being considered more important than a feature having a lower range. Scaling places features at the same importance before being applied to a machine learning algorithm. There are different methods used in scaling features, for this analysis we’ll be using standard scaling. Standard scaling transforms the data to have zero mean and a variance of one, thus making the data unitless.

Most machine learning algorithms only accept numerical features which makes categorical features unacceptable in their original form. Thus, we need to encode categorical features into numerical values. The act of replacing categories with numbers is called categorical encoding. For this we will use one-hot encoding. Categorical features are represented as a group of binary features, where each binary feature represents one category. The binary feature takes the integer value 1 if the category is present, or 0 otherwise.



set_config configures pre-processing steps such as Standard Scaler and One Hot Encoding to return a Pandas Data Frame




set_config(transform_output="pandas")




def define_columns(X_train):
  categorical= list(X_train.select_dtypes('category').columns)
  numerical = list(X_train.select_dtypes('number').columns)
  
  return categorical, numerical




categorical, numerical=define_columns(X_train)


First, we will create a function which will transform train and test sets for the Logistic Regression model. This entails dropping the first category of each feature during One Hot Encoding.



def transform_x_columns(X_train, X_test):
  ct_lr=ColumnTransformer(
  transformers=[
   ('scale',StandardScaler(), numerical),
   ('ohe',OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first'), categorical)])
  X_train_lr=ct_lr.fit_transform(X_train)
  X_test_lr=ct_lr.transform(X_test)
  
  return X_train_lr, X_test_lr
  
   
  
  



with contextlib.redirect_stderr(sys.stdout):
  X_train_lr, X_test_lr=transform_x_columns(X_train, X_test)





## ********************First Five Rows X_train_lr********************
##        scale__AmountOfTotalClaim  scale__InsuredAge  scale__CapitalGains
## 3409                   -1.852393          -1.614806             1.582364
## 19726                  -0.207519          -0.234876             1.636523
## 9912                    0.171842           1.395950             1.174367
## 5686                   -1.839096           0.266916             1.156314
## 11372                  -1.850651           1.395950             1.022723





## ********************First Five Rows X_test_lr********************
##        scale__AmountOfTotalClaim  scale__InsuredAge  scale__CapitalGains
## 7225                   -0.329962          -1.238462             1.224916
## 15229                  -0.220341          -0.611221             1.062439
## 24504                   0.747730          -0.987565             1.037165
## 14811                   1.059300          -0.485773            -0.836730
## 14954                   0.421042          -0.987565             1.701513



We will confirm that the columns of X_train_lr and X_test_lr are the same count after transformation




# Function to check column count
def check_columns_equal(df1, df2):
    assert df1.shape[1] == df2.shape[1], f"Error: Column counts do not match. df1 has {df1.shape[1]} columns, df2 has {df2.shape[1]} columns."
    print("Columns are equal.")


check_columns_equal(X_train_lr, X_test_lr)
## Columns are equal.



We see from the first five rows of the train, valid, and test sets that the features have been transformed while at the same time retaining the column feature names.



## Shape of X Train lr: (21135, 63)
## Shape of X Test lr: (7046, 63)





Next, we transform training, valid, and test sets for all other models. During One Hot Encoding, the first category will be dropped only if the feature is binary.







def transform_x_columns_tr(train, test):
  ct_tr=ColumnTransformer(
  transformers=[
   ('num',StandardScaler(), numerical),
   ('cat',OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'), categorical)])
  train_tr=ct_tr.fit_transform(train)
  test_tr=ct_tr.transform(test)
  
  return train_tr, test_tr




X_train_tr, X_test_tr=transform_x_columns_tr(X_train, X_test)




## ************First Five Rows X_train_tr************
##        num__AmountOfTotalClaim  num__InsuredAge  num__CapitalGains
## 3409                 -1.852393        -1.614806           1.582364
## 19726                -0.207519        -0.234876           1.636523
## 9912                  0.171842         1.395950           1.174367
## 5686                 -1.839096         0.266916           1.156314
## 11372                -1.850651         1.395950           1.022723



## ********************First Five Rows X_test_tr********************
##        num__AmountOfTotalClaim  num__InsuredAge  num__CapitalGains
## 7225                 -0.329962        -1.238462           1.224916
## 15229                -0.220341        -0.611221           1.062439
## 24504                 0.747730        -0.987565           1.037165
## 14811                 1.059300        -0.485773          -0.836730
## 14954                 0.421042        -0.987565           1.701513


We will confirm that the X_train_tr and X_test_tr columns count are equal


check_columns_equal(X_train_tr, X_test_tr)
## Columns are equal.



## Shape of X Train tr: (21135, 76)
## Shape of X Test tr: (7046, 76)



From the shape output we find there are 13 additional columns compared to the logistic regression transformed data.







For evaluating model performance, the event of interest we are interested in is if reported fraud is yes. This is considered the positive class. Classification metrics are used to determine how well our models predict the event of interest.



Metrics Definitions



Accuracy-measures the number of predictions that are correct as a percentage of the total number of predictions that are made. As an example, if 90% of your predictions are correct, your accuracy is simply 90%. Calculation: number of correct predictions/Number of total predictions. TP+TN/(TP+TN+FP+FN)

Precision-tells us about the quality of positive predictions. It may not find all the positives but the ones that the model does classify as positive are very likely to be correct. As an example, out of everyone predicted to have defaulted, how many of them did default? So, within everything that has been predicted as a positive, precision counts the percentage that is correct. Calculation: True positives/All Positives. TP/(TP+FP)

Recall- tells us about how well the model identifies true positives. The model may find a lot of positives, yet it also will wrongly detect many positives that are not actually positives. Out of all the patients who have the disease, how many were correctly identified? So, within everything that is positive, how many did the model successfully to find. A model with low recall is not able to find all (or a large part) of the positive cases in the data. Calculated as: True Positives/(False Negatives + True Positives)

F1 Score-The F1 score is defined as the harmonic mean of precision and recall.

The harmonic mean is an alternative metric for the more common arithmetic mean. It is often useful when computing an average rate. https://en.wikipedia.org/wiki/Harmonic_mean

The formula for the F1 score is the following: 2 times((precision*Recall)/(Precision + Recall))

Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:





recall_scorer = make_scorer(recall_score, pos_label=1)

precision_scorer = make_scorer(precision_score, pos_label=1)

roc_auc_scorer = make_scorer(roc_auc_score)



Model Training



Logistic Regression

Base Model





skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

lr_base_clf=logreg.fit(X_train_lr, y_train_np)


start_time = time.time()

lr_base_cv_accuracy=cross_val_score(lr_base_clf, X_train_lr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
log_CrossValAccurBase_time = time.time() - start_time

start_time = time.time()
lr_base_cv_recall_score=cross_val_score(lr_base_clf, X_train_lr, y_train_np, 
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValRecallBase_time = time.time() - start_time


start_time = time.time()
lr_base_cv_precision_score=cross_val_score(lr_base_clf, X_train_lr, y_train_np, 
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValPrecBase_time = time.time() - start_time


start_time = time.time()
lr_base_cv_auc_score=cross_val_score(lr_base_clf, X_train_lr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValAUCBase_time = time.time() - start_time


start_time = time.time()
lr_base_cv_f1=cross_val_score(lr_base_clf, X_train_lr, y_train_np, cv=skf, scoring='f1').mean().round(2)
log_CrossValF1Base_time = time.time() - start_time



Cross Validation with Paramters




lr=LogisticRegression(random_state=1)


lr_params={
'C': [0.0001,0.001, 0.01, 0.1, 1, 10], 
'penalty': ['l2'],
'max_iter': list(range(5000,40000, 5000)),
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

}

lr_search=RandomizedSearchCV(lr, lr_params, refit=True, 
verbose=3,cv=5,n_iter=6,scoring='roc_auc',return_train_score=True, n_jobs=-1)
start_time = time.time()
lr_search.fit(X_train_lr, y_train_np)
RandomizedSearchCV(cv=5, estimator=LogisticRegression(random_state=1), n_iter=6,
                   n_jobs=-1,
                   param_distributions={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],
                                        'max_iter': [5000, 10000, 15000, 20000,
                                                     25000, 30000, 35000],
                                        'penalty': ['l2'],
                                        'solver': ['newton-cg', 'lbfgs',
                                                   'liblinear', 'sag',
                                                   'saga']},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
log_grid_training_time = time.time() - start_time


lr_cv_results=pd.DataFrame(lr_search.cv_results_)
lr_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score    0.769316
## std_train_score     0.001226
## mean_test_score     0.765394
## std_test_score      0.004892
## dtype: float64


Mean train and test scores from cross validation indicate no over-fitting or under-fitting.







lr_clf=lr_search.best_estimator_
LogisticRegression(C=1, max_iter=30000, random_state=1, solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above displays gives us the parameters chosen for the logistic regression model.





start_time = time.time()
lr_cv_accuracy=cross_val_score(lr_clf, X_train_lr, y_train_np, 
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
log_CrossValAccur_time = time.time() - start_time

start_time = time.time()
lr_cv_f1_score=cross_val_score(lr_clf, X_train_lr, y_train_np, 
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)

log_CrossValF1_time = time.time() - start_time

start_time = time.time()

lr_cv_recall_score=cross_val_score(lr_clf, X_train_lr, y_train_np, 
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValRecall_time = time.time() - start_time

start_time = time.time()
lr_cv_precision_score=cross_val_score(lr_clf, X_train_lr, y_train_np, 
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValPrec_time = time.time() - start_time

start_time = time.time()
lr_cv_auc_score=cross_val_score(lr_clf, X_train_lr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValAuc_time = time.time() - start_time




log_cross_val_Time=(log_CrossValAccur_time+log_CrossValRecall_time+log_CrossValF1_time+ log_CrossValPrec_time+log_CrossValAuc_time)/5

Metrics and Feature Importance



cm_lr = metrics.confusion_matrix(y_test_np, y_pred_lr, labels=[0,1])
df_cm_lr = pd.DataFrame(cm_lr, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_lr.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_lr.flatten()/np.sum(cm_lr)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,6))
sns.heatmap(df_cm_lr, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Logistic Regression", fontsize=14)

plt.show()

plt.clf()

The confusion matrix plot displays the performance of a classifier. Accurate fraud predictions of Yes (True Positive) are located at the right-bottom of the matrix. Inaccurate fraude predictions of Yes (False Positive) are located at the top-right of the matrix. Accurate fraud predictions of No (True Negative) are located at the left-bottom of the matrix. Inaccurate fraud predictions of No (False Negatives) are located at the right-top of the matrix.


We see from the confusion matrix that 13.84% of fraud predictions were accurately predicted as Yes compared to 7.41% that were inaccurately predicted as yes.





We will now look at Feature Importance. Feature Importance is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction. This means the extent to which the feature contributes to the final output. However, feature importance does not inform us if the contribution is a positive or negative impact on the final output.






feature_importance_lr=pd.DataFrame({'feature':list(X_test_lr.columns),'feature_importance':[abs(i) for i in lr_clf.coef_[0]]})


feature_importance_lr=feature_importance_lr.sort_values('feature_importance',ascending=False)

For the logistical regression model we took the absolute value of the coefficients so as to get the Importance of the feature both with negative and positive effect.



Now that we have the importance of the features, we will now transform the coefficients for easier interpretation. The coefficients are in log odds format. We will transform them to odds-ratio format.




#Combine feature names and coefficients on top Pandas DataFrame
feature_names_lr=pd.DataFrame(X_test_lr.columns, columns=['Feature'])

log_coef=pd.DataFrame(np.transpose(lr_clf.coef_), columns=['Coefficent'])

coefficients=pd.concat([feature_names_lr, log_coef], axis=1)

#Calculate exponent of the logistic regression coefficients

coefficients['Exp_Coefficient']=np.exp(coefficients['Coefficent'])
#Remove coefficients that are equal to zero.

coefficients=coefficients[coefficients['Exp_Coefficient']>=1]


coefficients_tp5=coefficients.nlargest(5,"Exp_Coefficient")



## ******************Top Five Coefficients******************
##                                     Feature  Exp_Coefficient
## 50       ohe__InsuredRelationship_unmarried         1.608058
## 34                         ohe__Witnesses_2         1.557392
## 48  ohe__InsuredRelationship_other-relative         1.528998
## 47   ohe__InsuredRelationship_not-in-family         1.487331
## 58     ohe__IncidentPeriodDay_early morning         1.410756



Three levels of Insured Relationtionship are in the top five along with the level of witnesses=1 and period of day=morning.



Support Vector Machine



Support Vector Machine (the “road machine”) is responsible for finding the decision boundary to separate different classes and maximize the margin. A decision boundary differentiates two classes. A data point falling on either side of the decision boundary can be attributed to different classes. Binary classes would be either yes or no.





from sklearn.svm import SVC

Base Model




svc=SVC(random_state=1, kernel="rbf")


svc_base_clf=svc.fit(X_train_lr, y_train_np)

start_time = time.time()
svc_base_cv_accuracy=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
svc_CrossValAccurBase_time = time.time() - start_time


start_time = time.time()
svc_base_cv_recall=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring=recall_scorer, n_jobs=-1).mean().round(2)
svcBase_CrossValRcall = time.time() - start_time


start_time = time.time()
svc_base_cv_precision=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring=precision_scorer, n_jobs=-1).mean().round(2)
svcBase_CrossValprec = time.time() - start_time


start_time = time.time()
svc_base_cv_f1=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring='f1', n_jobs=-1).mean().round(2)
svcBase_CrossValF1 = time.time() - start_time


start_time = time.time()
svc_base_cv_auc_score=cross_val_score(svc_base_clf, X_train_lr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
svcBase_CrossValAuc = time.time() - start_time


svcBase_cross_val_Time=(svc_CrossValAccurBase_time+svcBase_CrossValRcall+svcBase_CrossValF1+ svcBase_CrossValprec+svcBase_CrossValAuc)/5



Cross validation with paramters




param_grid_svc = {'C': [0.0001,.001,.01,1, 10, 100], 'gamma': [1,0.1,0.01,0.001, .0001]}


grid_svc=RandomizedSearchCV(svc,param_grid_svc, refit=True, 
verbose=3,cv=5,n_iter=6, scoring='roc_auc',return_train_score=True, n_jobs=-1)
start_time = time.time()
grid_svc.fit(X_train_lr, y_train_np)
RandomizedSearchCV(cv=5, estimator=SVC(random_state=1), n_iter=6, n_jobs=-1,
                   param_distributions={'C': [0.0001, 0.001, 0.01, 1, 10, 100],
                                        'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
svc_grid_training_time = time.time() - start_time



svc_cv_results=pd.DataFrame(grid_svc.cv_results_)
svc_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score    0.866267
## std_train_score     0.002293
## mean_test_score     0.824598
## std_test_score      0.004312
## dtype: float64



cross validation score results show the mean train score is .03 higher then an the mean test score which may indicate small over fitting. Applying the classifier to the test data will help clarify this.






svc_clf=grid_svc.best_estimator_
SVC(C=10, gamma=0.1, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above displays gives us the parameters chosen for the support vector machine Model.





start_time = time.time()
svc_cv_accuracy=cross_val_score(svc_clf, X_train_lr, y_train_np, 
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValAccur_time = time.time() - start_time



start_time = time.time()
svc_cv_f1_score=cross_val_score(svc_clf, X_train_lr, y_train_np, 
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValF1_time = time.time() - start_time

start_time = time.time()
svc_cv_recall_score=cross_val_score(svc_clf, X_train_lr, y_train_np, 
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValRecall_time = time.time() - start_time


start_time = time.time()
svc_cv_precision=cross_val_score(svc_clf, X_train_lr, y_train_np, 
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValPrec_time = time.time() - start_time


start_time = time.time()
svc_cv_auc=cross_val_score(svc_clf, X_train_lr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValAuc_time = time.time() - start_time



Test Set






y_predBase_svc=svc_base_clf.predict(X_test_lr)
print('******SVC  Classification Report******')
## ******SVC  Classification Report******
print(classification_report(y_test_np, y_predBase_svc))
##               precision    recall  f1-score   support
## 
##            0       0.91      0.97      0.94      5137
##            1       0.90      0.74      0.81      1909
## 
##     accuracy                           0.91      7046
##    macro avg       0.90      0.85      0.87      7046
## weighted avg       0.91      0.91      0.90      7046




svc_AccuracyBase_test=roc_auc_score(y_test_np, y_predBase_svc).round(2)
cm_svc = metrics.confusion_matrix(y_test_np, y_predBase_svc, labels=[0,1])
df_cm_svc = pd.DataFrame(cm_svc, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_svc.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_svc.flatten()/np.sum(cm_svc)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,6))
sns.heatmap(df_cm_svc, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Support Vector M<achine", fontsize=14)

plt.show()

plt.clf()

Our SVC classifier performed better on accurately predicting fraud of yes (20%) than the logistic regression classfier (13.84%). Additionaly, the svc classifier inaccurate fraud predictions of yes was only 2.28% compated to the logistic regression model’s 7.41%

Random Forest



Random forest is an ensemble learning method. Ensemble learning takes predictions from multiple models are merges them to enhance the accuracy of prediction. There are four types of ensemble techniques. We’ll be using Bagging (which random forest is an example of) and boosting, which our next four models will be an example of.

Bagging involves fitting many decision trees on different samples of the same dataset and averaging the predictions.

Random Forest models are made up of individual decision trees whose predictions are combined for a final result. The final result is decided using majority rules which means that the final prediction is what the majority of the decision tree models chose. An example would be 5 models in which 3 of the 5 models predict ‘yes’ for the classification problem.

Random Forests can be made up of thousands of decision trees.

Simply put, the random forest builds multiple decision trees and merges them together to get a more accurate prediction.

Random Forest for Beginners





from sklearn.ensemble import RandomForestClassifier



Base Model




rf=RandomForestClassifier(random_state=1,n_jobs=-1)


rf_base_clf=rf.fit(X_train_tr, y_train_np)

start_time = time.time()
rf_base_cv_accuracy=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
rfBase_CrossValAccur = time.time() - start_time


start_time = time.time()
rf_base_cv_recall=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='recall').mean().round(2)
rfBase_CrossValRecall = time.time() - start_time

start_time = time.time()
rf_base_cv_precision=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='precision').mean().round(2)
rfBase_CrossValPrec = time.time() - start_time

start_time = time.time()
rf_base_cv_f1=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='f1').mean().round(2)
rfBase_CrossValF1 = time.time() - start_time

start_time = time.time()
rf_base_cv_auc=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring=roc_auc_scorer).mean().round(2)
rfBase_CrossValAuc = time.time() - start_time



Crossvalidation with parameters







rf_params={'n_estimators':[500,1000,22500,5000],         
'max_features':[0.25,0.50,0.75,1.0],
'min_samples_split':[2,4,6,8], 
#'max_depth': [500, 1000, 2000, 4000,6000], 
'max_depth': list(range(500,15000,500)),
'min_samples_leaf': [3, 4, 5, 6],
'criterion': ['gini', 'entropy', 'log_loss']}



rf_search=RandomizedSearchCV(rf, rf_params, n_iter=6,refit=True, 
verbose=3,cv=5, scoring='roc_auc',return_train_score=True, n_jobs=-1)
start_time = time.time()
rf_search.fit(X_train_tr, y_train_np)
RandomizedSearchCV(cv=5,
                   estimator=RandomForestClassifier(n_jobs=-1, random_state=1),
                   n_iter=6, n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy',
                                                      'log_loss'],
                                        'max_depth': [500, 1000, 1500, 2000,
                                                      2500, 3000, 3500, 4000,
                                                      4500, 5000, 5500, 6000,
                                                      6500, 7000, 7500, 8000,
                                                      8500, 9000, 9500, 10000,
                                                      10500, 11000, 11500,
                                                      12000, 12500, 13000,
                                                      13500, 14000, 14500],
                                        'max_features': [0.25, 0.5, 0.75, 1.0],
                                        'min_samples_leaf': [3, 4, 5, 6],
                                        'min_samples_split': [2, 4, 6, 8],
                                        'n_estimators': [500, 1000, 22500,
                                                         5000]},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rf_grid_training_time = time.time() - start_time


rf_cv_results=pd.DataFrame(rf_search.cv_results_)
rf_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score    0.993862
## std_train_score     0.000114
## mean_test_score     0.905631
## std_test_score      0.005357
## dtype: float64


Cross validation mean score for train is .09 higher than the mean test score. This could be a case of overfitting. Applying the classifier to test data will provide more metrics to help us.



RandomForestClassifier(criterion='entropy', max_depth=14500, max_features=0.5,
                       min_samples_leaf=3, min_samples_split=6,
                       n_estimators=500, n_jobs=-1, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The above display presents the parameters chosen for the random Forest classifier.


start_time = time.time()
rf_cv_accuracy=cross_val_score(rf_clf, X_train_tr, y_train_np, 
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
rf_CrossAccur_time = time.time() - start_time

start_time = time.time()
rf_cv_recall=cross_val_score(rf_clf, X_train_tr, y_train_np, 
scoring=recall_scorer,cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValRecall_time = time.time() - start_time

start_time = time.time()
rf_cv_precision=cross_val_score(rf_clf, X_train_tr, y_train_np, 
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValPrec_time = time.time() - start_time

start_time = time.time()
rf_cv_f1=cross_val_score(rf_clf, X_train_tr, y_train_np, 
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValF1_time = time.time() - start_time

start_time = time.time()
rf_cv_auc=cross_val_score(rf_clf, X_train_tr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValAuc_time = time.time() - start_time



Test Set


We’ll now use the classifier to predict on the test data.





y_pred_Base_rf=rf_base_clf.predict(X_test_tr)



## *******Random Forest Classification Report********
##               precision    recall  f1-score   support
## 
##            0       0.91      0.97      0.94      5137
##            1       0.91      0.73      0.81      1909
## 
##     accuracy                           0.91      7046
##    macro avg       0.91      0.85      0.87      7046
## weighted avg       0.91      0.91      0.90      7046





rfBase_recall_test=recall_score(y_test_np, y_pred_Base_rf, pos_label=1).round(2)




rfBase_roc_test=roc_auc_score(y_test_np, y_pred_Base_rf).round(2)


rfBase_precision_test=precision_score(y_test_np, y_pred_Base_rf, pos_label=1).round(2)


rfBase_test_accuracy=accuracy_score(y_test_np, y_pred_Base_rf).round(2)


rfBase_test_f1=f1_score(y_test_np, y_pred_Base_rf).round(2)



cm_rf_vl = metrics.confusion_matrix(y_test_np, y_pred_Base_rf, labels=[0,1])
df_cm_rf_vl = pd.DataFrame(cm_rf_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_rf_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_rf_vl.flatten()/np.sum(cm_rf_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,9))
sns.heatmap(df_cm_rf_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Decision Tree", fontsize=14)

plt.show()

plt.clf()

The performance of our random forest classifier at accurately predicting fraud of Yes is 19.66%, only slightly less than the SVC’s 20%. Random Forest did have a slightly lower percentage of inaccurately predicing fraud of yes (2.06%)



Feature Importance





# let's create a dictionary of features and their importance values
feat_dict_rf= {}
for col, val in sorted(zip(X_train_tr.columns, rf_base_clf.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict_rf[col]=val



feat_rf_df = pd.DataFrame({'Feature':feat_dict_rf.keys(),'Importance':feat_dict_rf.values()})


feat_rf_tp5=feat_rf_df.nlargest(5,"Importance")
values = feat_rf_tp5.Importance    
idx = feat_rf_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Random Forest Model')

plt.ylabel("Features", fontsize=510)

plt.tick_params(axis='x', which='major', labelsize=9)

plt.tick_params(axis='y', labelsize=7,labelrotation=42)

plt.show()


plt.clf()

Of the top features for our random forest model, it’s interesting to note that two through five are also important features of our anomaly detection model.



Gradient Boosting



Gradient boosting also uses incorrect predictions from previous trees to adjust the next tree though this is accomplished by fitting each new tree based on the errors of the previous tree’s predictions. Mistakes from the previous trees are used to build a new tree solely around these mistakes. As mentioned early in AdaBoost, gradient boosting is taking these errors (weak learner) and making them a strong learner. The difference is the gradient boost algorithm only uses the errors from the previous tree in contrast to AdaBoost.

The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. Errors are reduced by building a new model on the errors or residuals of the previous model.









from sklearn.ensemble import GradientBoostingClassifier

Base Model






gb_base_clf=gb.fit(X_train_tr, y_train_np)


start_time = time.time()
gb_base_cv_accuracy=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
gbBase_CrossValAccur = time.time() - start_time

start_time = time.time()
gb_base_cv_recall=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=recall_scorer).mean().round(2)
gbBase_CrossValRecall= time.time() - start_time

start_time = time.time()
gb_base_cv_precision=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=precision_scorer).mean().round(2)
gbBase_CrossValPrec = time.time() - start_time
start_time = time.time()
gb_base_cv_f1=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='f1').mean().round(2)
gbBase_CrossValF1 = time.time() - start_time

start_time = time.time()
gb_base_cv_auc=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=roc_auc_scorer).mean().round(2)
gbBase_CrossValAuc = time.time() - start_time



Cross Validcation with parameters





gb_params={
  'subsample':[0.4, 0.6, 0.7, 0.75],
  'n_estimators':np.arange(500, 10000, 500),
  'learning_rate':[0.0001, 0.001,.01,0.05, 0.075,0.1],
  'max_features':range(6,20,2),
  'min_samples_split':range(1000,2200,200),
  'min_samples_leaf':range(30,70,10),
  'max_depth':range(4,16,2),
  
}


search_cv_gb=RandomizedSearchCV(estimator=gb,param_distributions=gb_params,n_iter=6, scoring='roc_auc', cv=5, verbose=1, refit=True,return_train_score=True, n_jobs=-1, random_state=2)
start_time = time.time()
search_cv_gb.fit(X_train_tr, y_train_np)
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(warm_start=True),
                   n_iter=6, n_jobs=-1,
                   param_distributions={'learning_rate': [0.0001, 0.001, 0.01,
                                                          0.05, 0.075, 0.1],
                                        'max_depth': range(4, 16, 2),
                                        'max_features': range(6, 20, 2),
                                        'min_samples_leaf': range(30, 70, 10),
                                        'min_samples_split': range(1000, 2200, 200),
                                        'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500,
       6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500]),
                                        'subsample': [0.4, 0.6, 0.7, 0.75]},
                   random_state=2, return_train_score=True, scoring='roc_auc',
                   verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

gb_grid_training_time = time.time() - start_time


gb_cv_results=pd.DataFrame(search_cv_gb.cv_results_)


gb_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score    0.933659
## std_train_score     0.000696
## mean_test_score     0.881175
## std_test_score      0.006381
## dtype: float64


Cross validation mean score for train is .05 higher than the mean test score. This could be a case of overfitting. Applying the classifier to test data will provide more metrics to help us.





gb_clf=search_cv_gb.best_estimator_
gb_clf
GradientBoostingClassifier(max_depth=12, max_features=10, min_samples_leaf=30,
                           min_samples_split=1000, n_estimators=3000,
                           subsample=0.7, warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above display presents the parameters chosen for the gradient boost model.




start_time = time.time()
gb_cv_f1_score=cross_val_score(gb_clf, X_train_tr, y_train_np, 
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValF1_time = time.time() - start_time

start_time = time.time()
gb_cv_accuracy=cross_val_score(gb_clf, X_train_tr, y_train_np, 
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValAccur_time = time.time() - start_time

start_time = time.time()
gb_cv_recall=cross_val_score(gb_clf, X_train_tr, y_train_np, 
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValRecall_time = time.time() - start_time

start_time = time.time()
gb_cv_precision=cross_val_score(gb_clf, X_train_tr, y_train_np, 
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValPrec_time = time.time() - start_time


start_time = time.time()
gb_cv_auc=cross_val_score(gb_clf, X_train_tr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValAuc_time = time.time() - start_time



Metrics and Feature Importance







cm_gb_vl = metrics.confusion_matrix(y_test_np, y_pred_gb, labels=[0,1])
df_cm_gb_vl = pd.DataFrame(cm_gb_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_gb_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_gb_vl.flatten()/np.sum(cm_gb_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,9))
sns.heatmap(df_cm_gb_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')


plt.title("Confusion Matrix-Gradient Boost", fontsize=14)

plt.show()


plt.clf()

The performance of our gradient boost classifier at accurately predicting fraud of Yes is 21.05%, which is just above the random Forest’s 19.66%, only and SVC’s 20%. However, gradient boost’s inaccurate predictions of fraud of yes is 3.18%, 1% more than the other two classifiers.





#  create a dictionary of features and their importance values
feat_dict_gb = {}
for col, val in sorted(zip(X_train_tr.columns,gb_clf.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict_gb[col]=val


feat_gb_df= pd.DataFrame({'Feature':feat_dict_gb.keys(),'Importance':feat_dict_gb.values()})

feat_gb_tp5=feat_gb_df.nlargest(5,"Importance")
values = feat_gb_tp5.Importance    
idx = feat_gb_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features  Gradient Boosting Model')

plt.ylabel("Features", fontsize=8)

plt.tick_params(axis='x', which='major', labelsize=8)

plt.tick_params(axis='y', labelsize=7, labelrotation=42)

plt.show()


plt.clf()

The top features of our gradient boost model are the same as those from the random forest model



Extreme Gradient Boosting



Extreme Gradient boosting is similar to gradient boosting with a few improvements. First, enhancements make it faster than other ensemble methods. Secondly, built-in regularization allows it to have an advantage in accuracy. Regularization is the process of adding information to reduce variance and prevent over fitting.






from xgboost import XGBClassifier



Base Model




xgb=XGBClassifier(booster='gbtree',objective='binary:logistic', n_jobs=-1)


xgb_base_clf=xgb.fit(X_train_tr, y_train_np)
start_time = time.time()


xgb_base_cv_accuracy=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
xgbBase_CrossValAccur = time.time() - start_time

start_time = time.time()
xgb_base_cv_recall=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=recall_scorer).mean().round(2)
xgbBase_CrossValRecall = time.time() - start_time

start_time = time.time()
xgb_base_cv_precision=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=precision_scorer).mean().round(2)
xgbBase_CrossValPrec = time.time() - start_time

start_time = time.time()
xgb_base_cv_f1=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='f1').mean().round(2)
xgbBase_CrossValF1 = time.time() - start_time

start_time = time.time()
xgb_base_cv_auc=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=roc_auc_scorer).mean().round(2)
xgbBase_CrossValAuc = time.time() - start_time



Cross Validcation with Parameters



params_xg={
    "learning_rate": [0.01, 0.05, 0.10, 0.20,0.25,0.4, 0.5],
    "max_depth": range(2,10,2),
    "min_child_weight": [1,3,5,7],
    "gamma": [0.0,0.01,0.05,0.1,0.5,1,2,3],
    "colsample_bytree": [0.5,0.6,0.7,0.8,0.9,1],
    "colsample_bynode": [0.5,0.6,0.7,0.8,0.9,1],
    "colsample_bylevel": [0.5,0.6,0.7,0.8,0.9,1],
    "n_estimators":np.arange(500, 4000, 500),
    'subsample': [0.5,0.6,0.7,0.8,0.9,1]
    
    
  
  
}


search_xg=RandomizedSearchCV(estimator=xgb,
param_distributions=params_xg,n_iter=6, scoring='roc_auc', cv=5, verbose=3, refit=True,return_train_score=True, n_jobs=-1)



start_time = time.time()
search_xg.fit(X_train_tr, y_train_np)
RandomizedSearchCV(cv=5,
                   estimator=XGBClassifier(base_score=None, booster='gbtree',
                                           callbacks=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric=None, feature_types=None,
                                           gamma=None, gpu_id=None,
                                           grow_policy=None,
                                           importance_type=None,
                                           interaction_constraints=None,
                                           learning_...
                                        'colsample_bynode': [0.5, 0.6, 0.7, 0.8,
                                                             0.9, 1],
                                        'colsample_bytree': [0.5, 0.6, 0.7, 0.8,
                                                             0.9, 1],
                                        'gamma': [0.0, 0.01, 0.05, 0.1, 0.5, 1,
                                                  2, 3],
                                        'learning_rate': [0.01, 0.05, 0.1, 0.2,
                                                          0.25, 0.4, 0.5],
                                        'max_depth': range(2, 10, 2),
                                        'min_child_weight': [1, 3, 5, 7],
                                        'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500]),
                                        'subsample': [0.5, 0.6, 0.7, 0.8, 0.9,
                                                      1]},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

xgb_grid_training_time = time.time() - start_time



xgb_cv_results=pd.DataFrame(search_xg.cv_results_)


xgb_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score    0.984471
## std_train_score     0.000194
## mean_test_score     0.902756
## std_test_score      0.006329
## dtype: float64


Cross validation mean score for train is .08 higher than the mean test score. This could be a case of overfitting. Applying the classifier to test data will provide more metrics to help us.





xg_clf=search_xg.best_estimator_





XGBClassifier(base_score=None, booster='gbtree', callbacks=None,
              colsample_bylevel=0.7, colsample_bynode=1, colsample_bytree=0.6,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0.01, gpu_id=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=8, max_leaves=None,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=1500, n_jobs=-1, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

start_time = time.time()
xg_cv_accuracy=cross_val_score(xg_clf, X_train_tr, y_train_np, 
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)

xgb_CrossValAccur_time = time.time() - start_time

start_time = time.time()
xg_cv_recall=cross_val_score(xg_clf, X_train_tr, y_train_np, 
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValRecall_time = time.time() - start_time


start_time = time.time()
xg_cv_precision=cross_val_score(xg_clf, X_train_tr, y_train_np, 
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValPrec_time = time.time() - start_time



start_time = time.time()
xg_cv_auc=cross_val_score(xg_clf, X_train_tr, y_train_np, 
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValAuc_time = time.time() - start_time

start_time = time.time()
xg_cv_f1=cross_val_score(xg_clf, X_train_tr, y_train_np, 
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValF1_time = time.time() - start_time



Test Set






y_test_pred_base_xg=xgb_base_clf.predict(X_test_tr)


## *******Extreme Gradient Boost Classification Report********
##               precision    recall  f1-score   support
## 
##            0       0.91      0.97      0.94      5137
##            1       0.89      0.74      0.81      1909
## 
##     accuracy                           0.91      7046
##    macro avg       0.90      0.86      0.88      7046
## weighted avg       0.91      0.91      0.90      7046






xg_recall_test_base=recall_score(y_test_np, y_test_pred_base_xg).round(2)



xg_roc_test_base=roc_auc_score(y_test_np, y_test_pred_base_xg).round(2)


xg_test_accuracy_base=accuracy_score(y_test_np, y_test_pred_base_xg).round(2)

xg_test_f1_base=f1_score(y_test_np, y_test_pred_base_xg).round(2)



cm_xg_vl = metrics.confusion_matrix(y_test_np, y_test_pred_base_xg, labels=[0,1])
df_cm_xg_vl = pd.DataFrame(cm_xg_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_xg_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_xg_vl.flatten()/np.sum(cm_xg_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,9))
sns.heatmap(df_cm_xg_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')


plt.title("Confusion Matrix-Extreme Gradient Boost", fontsize=14)

plt.show()


plt.clf()

The extreme gradient classifier’s accurate and inaccurate predictions of fraud Yes are close to both random forest and svc classifiers at 20.18% and 2.41%. respectively.



Feature Importance






# let's create a dictionary of features and their importance values
feat_dict_xg= {}
for col, val in sorted(zip(X_train_tr.columns,xgb_base_clf.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict_xg[col]=val


feat_xg_df = pd.DataFrame({'Feature':feat_dict_xg.keys(),'Importance':feat_dict_xg.values()})


feat_xg_tp5=feat_xg_df.nlargest(5,"Importance")
values = feat_xg_tp5.Importance    
idx = feat_xg_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Extreme Gradient Boost Model')

plt.ylabel("Features", fontsize=9)

plt.tick_params(axis='x', which='major', labelsize=8)

plt.tick_params(axis='y', labelsize=7, labelrotation=42)

plt.show()


plt.clf()

Extreme Gradient Boost’s highest scoring important feature of severity of incident level Major Damage the same as gradient boost. The difference is it’s score of 20% dominates compared to the other important features scores which hover around 2%.



Model Comparison



metric_comparison=pd.DataFrame({'Model':['Logistic Regresion','Support Vector Machine', 'Random Forest', 'Gradient Boosting', 'Extreme Gradient Boosting'],
'RecallBase':[lr_base_cv_recall_score, svc_base_cv_recall,rf_base_cv_recall, gb_base_cv_recall, xgb_base_cv_recall],
'RecallTune':[lr_cv_recall_score, svc_cv_recall_score,rf_cv_recall, gb_cv_recall, xg_cv_recall],
'PrecisionBase':[lr_base_cv_precision_score, svc_base_cv_precision,rf_base_cv_precision, gb_base_cv_precision, xgb_base_cv_precision],
'PrecisionTune':[lr_cv_precision_score, svc_cv_precision,rf_cv_precision, gb_cv_precision, xg_cv_precision],
'F1Base':[lr_base_cv_f1,svc_base_cv_f1, rf_base_cv_f1,gb_base_cv_f1, xgb_base_cv_f1],
'F1Tune':[lr_cv_f1_score, svc_cv_f1_score,rf_cv_f1, gb_cv_f1_score, xg_cv_f1],
'AUCBase':[lr_base_cv_auc_score, svc_base_cv_auc_score,rf_base_cv_auc, gb_base_cv_auc, xgb_base_cv_auc],
'AUCTune':[lr_cv_auc_score, svc_cv_auc, rf_cv_auc, gb_cv_auc,xg_cv_auc],
'GridTuneTime':[log_grid_training_time, svc_grid_training_time, rf_grid_training_time, gb_grid_training_time,xgb_grid_training_time],
'CVTime':[log_cross_val_Time, svc_cross_val_Time, rf_cross_val_Time, gb_cross_val_Time, xgb_cross_val_Time],
'CVBaseTime':[log_cross_valBase_Time,svcBase_cross_val_Time, rfBase_cross_val_Time,gbBase_cross_val_Time,xgbBase_cross_val_Time]})




metricTest_comparison=pd.DataFrame({'Model':['Support Vector Machine', 'Random Forest',  'Extreme Gradient Boosting'],
'RecallBase':[svc_base_cv_recall,rf_base_cv_recall, xgb_base_cv_recall],
'RecallTest':[svc_recallBase_test,rfBase_recall_test, xg_recall_test_base],
'PrecisionBase':[svc_base_cv_precision,rf_base_cv_precision, xgb_base_cv_precision],
'PrecisionTest':[svcBase_precision_test,rfBase_precision_test, xg_test_precision_base],
'F1Base':[svc_base_cv_f1, rf_base_cv_f1,xgb_base_cv_f1],
'F1Test':[svc_f1Base_test,rfBase_test_f1, xg_test_f1_base],
'AUCBase':[svc_base_cv_auc_score,rf_base_cv_auc, xgb_base_cv_auc],
'AUCTest':[svc_aucBase_test, rfBase_roc_test,xg_roc_test_base]

})




metric_comparison=metric_comparison.round({'RecallCVbase':2,'RecallCVgridSearch':2,'PrecisionCVbase':2,
'PrecisionCVgridSearch':2,'F1CVbase':2,'F1CVgridSearch':2,'AUCcvBase':2, 'AUCcvGridSearch':2,'GridTuneTime':2, 'CVTime':2, 'CVBaseTime':2})
library(dplyr)
library(gt)
library(gtExtras)
library(ggsci)
library(RColorBrewer)
library(ggplot2)
library(readr)


gt_comp_tbl <- 
  gt(metric_comparison) %>%
  tab_header(
    title = md("**Model Cross Validation Comparison**"),
    subtitle = "Evaluation and Performance Metrics"
  ) %>%
  tab_spanner(
    label = "Metrics",
    columns = c(RecallBase, RecallTune , PrecisionBase,PrecisionTune,F1Base,F1Tune,AUCBase, AUCTune)
  ) %>%
  tab_spanner(
    label = "Time",
    columns = c(GridTuneTime,CVTime, CVBaseTime,CVTimeDiff )
  ) %>% 
  tab_style(
    style=cell_text(size=px(10)),
    locations = cells_column_labels(c(Model,RecallBase, RecallTune , PrecisionBase,PrecisionTune,F1Base,F1Tune,AUCBase, AUCTune,GridTuneTime,CVTime, CVBaseTime, CVTimeDiff)
      )) %>% 
  tab_style(
    style=cell_text(size=px(9.5)),
    locations = cells_body(c(RecallBase, RecallTune , PrecisionBase,PrecisionTune,F1Base,F1Tune,AUCBase, AUCTune,GridTuneTime,CVTime, CVBaseTime, CVTimeDiff))
  ) %>% 
  tab_style(
    style = cell_text(size=px(10)),
    locations=cells_body(Model)
  ) %>% 
  data_color(
    columns=c(PrecisionBase, PrecisionTune),
    method="numeric",
    palette="YlGn",
    domain=c(0.91,0.5)) %>% 
  data_color(
    columns=c(RecallBase, RecallTune, F1Base, F1Tune, AUCBase, AUCTune),
    palette=c("#ffffff","#5A2D81"), domain=c(0.90,0.5)) %>% 
  data_color(
    columns=c(GridTuneTime,CVTime,CVBaseTime, CVTimeDiff),
    palette=c("#ffffff","#FFC72C"), domain=c(231,-59)) %>%
  
  tab_options(table.background.color = "lightcyan") %>% 
   tab_source_note(source_note = md("**Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes)**")) %>% 
  tab_style(
    style=cell_text(size=px(9)),
    locations=cells_source_notes()
  )



Model Cross Validation Comparison
Evaluation and Performance Metrics
Model
Metrics
Time
RecallBase RecallTune PrecisionBase PrecisionTune F1Base F1Tune AUCBase AUCTune GridTuneTime CVTime CVBaseTime CVTimeDiff
Logistic Regresion 0.51 0.51 0.65 0.65 0.57 0.57 0.70 0.70 1.46 0.66 1.78 1.12
Support Vector Machine 0.73 0.78 0.89 0.85 0.80 0.81 0.85 0.86 171.50 92.37 33.86 -58.51
Random Forest 0.72 0.73 0.90 0.90 0.80 0.80 0.84 0.85 230.96 21.77 2.66 -19.11
Gradient Boosting 0.55 0.77 0.74 0.87 0.63 0.82 0.74 0.87 135.59 42.59 41.63 -0.96
Extreme Gradient Boosting 0.75 0.76 0.89 0.91 0.81 0.83 0.86 0.86 98.62 47.93 10.76 -37.17
Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes)



The table above presents evaluation metrics for class prediction of 1 (Yes), where the event of interest is a fraudulent submission. While all metrics are considered, Precision is our primary focus. A higher precision score indicates fewer false positives—cases where the model incorrectly flags a legitimate submission as fraudulent. Minimizing these errors is critical, as we do not want to falsely accuse a policyholder of fraud.

Recall, which reflects the rate of false negatives (predicting “No” when the actual class is “Yes”), is also important and will be the secondary metric of focus after precision.

From the table, we observe that the Support Vector Machine (SVM) model achieves the highest precision score, followed by Random Forest and Extreme Gradient Boost. Interestingly, the base models for SVM and Random Forest performed slightly better (by 0.01) than their tuned counterparts. Similarly, the base model for Extreme Gradient Boost was only 0.01 lower than its tuned version.

This presents two key advantages of using base models:

-No hyperparameter tuning required, which significantly reduces grid search time—avoiding delays ranging from approximately 88 to 140 seconds.

-Faster cross-validation time, with reductions ranging from 17 to 34 seconds.

Our next step is to compare these cross-validation results with those from the test set (unseen data). We will proceed with the three models that showed the best precision and training efficiency: SVM, Random Forest, and Extreme Gradient Boost.



gt_Testcomp_tbl <- 
  gt(metricTests_comparison) %>%
  tab_header(
    title = md("**Model Test/Cross Validationm Comparison**"),
    subtitle = "Evaluation Metrics"
  ) %>%
  tab_spanner(
    label = "Metrics",
    columns = c(RecallBase, RecallTest , PrecisionBase,PrecisionTest,F1Base,F1Test,AUCBase, AUCTest)
  ) %>%
  
  
  tab_style(
    style=cell_text(size=px(10)),
    locations = cells_column_labels(c(Model,RecallBase, RecallTest,PrecisionBase,PrecisionTest,F1Base,F1Test,AUCBase, AUCTest)
      )) %>% 
  tab_style(
    style=cell_text(size=px(10)),
    locations=cells_body(c(RecallBase, RecallTest, PrecisionBase,PrecisionTest,F1Base,F1Test, AUCBase, AUCTest))
  ) %>% 
  tab_style(
    style = cell_text(size=px(11)),
    locations=cells_body(Model)
  ) %>% 
  data_color(
    columns=c(PrecisionBase,PrecisionTest),
    method="numeric",
    palette="Blues",
    domain=c(0.91,0.88)) %>% 
  data_color(
    columns=c(RecallBase,  F1Base, AUCBase),
    palette=c("#ffffff","#5A2D81"), domain=c(0.86,0.71)) %>% 
  data_color(
    columns=c( RecallTest, F1Test, AUCTest),
    palette=c("#ffffff","#FFC72C"), domain=c(0.86,0.71)) %>%
  
  tab_options(table.background.color = "lightcyan") %>% 
  tab_source_note(source_note = md("**Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes)**")) %>% 
  tab_style(
    style=cell_text(size=px(9)),
    locations=cells_source_notes()
  ) %>% 
  tab_style(
    style = cell_text(size=px(9)),
    locations = cells_column_spanners()
  )



Model Test/Cross Validationm Comparison
Evaluation Metrics
Model
Metrics
Difference in Cross Validation and Test Scores
RecallBase RecallTest PrecisionBase PrecisionTest F1Base F1Test AUCBase AUCTest Recall_diff Precision_diff F1_diff AUC_diff
Support Vector Machine 0.73 0.74 0.89 0.90 0.80 0.81 0.85 0.85 -0.01 -0.01 -0.01 0.00
Random Forest 0.72 0.73 0.90 0.91 0.80 0.81 0.84 0.85 -0.01 -0.01 -0.01 -0.01
Extreme Gradient Boosting 0.75 0.74 0.89 0.89 0.81 0.81 0.86 0.86 0.01 0.00 0.00 0.00
Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes)



The table above compares cross-validation scores with test set performance to evaluate how well our selected models generalize to unseen data. Specifically, we are looking for signs of overfitting or underfitting:

-Overfitting occurs when a model performs well on training data but poorly on new data—often indicated by training scores significantly higher than test scores.

-Underfitting happens when a model doesn’t learn the training data well, which can be suggested when 
test scores are higher than training scores.

Looking at the metrics, we see that both the Support Vector Machine and Random Forest models exhibit a slight underfitting pattern for Precision, with a minor drop of 0.01 between cross-validation and test scores.

Examining other metrics:

-Recall and AUC scores are very consistent across train and test sets.

-Both SVM and Random Forest models show only a 0.01 difference in Recall and AUC, suggesting strong generalization.

While both models perform similarly and generalize well, we note that Random Forest has a slightly higher Precision Test score (0.91 vs. 0.90) and matches SVM in F1 and AUC.

Given the strong overall performance and balance across metrics, Random Forest is selected as the final model for fitting to new data. However, SVM remains a strong alternative and may still be considered in further evaluations.





New Data



Now that we have chosen a final model, we will use it to predict on new data. We will go through the same steps of cleaning, feature engineering, and preparation as with the original data.

The only difference between the new data and the data used for training and testing is that there are no labels, meaning that the observations have not been labled as fraud or no fraud.

We will validate that the columns of our newly imported data are equal to the columns of our originally imported data




def df_columns_equal(df_original, df_new):
  assert df_original.columns.equals(df_new.columns), f"Mismatch in orginal data frame and new data frame columns"
  
  print("Orginal and new data frame columns match")



df_columns_equal(Train_Demographics_p, new_Demographics)
## Orginal and new data frame columns match


df_columns_equal(Train_Claim_p,new_Claim)
## Orginal and new data frame columns match


df_columns_equal(Train_Policy_p,new_Policy)
## Orginal and new data frame columns match



We’ve confirmed that the columns of our new data sets are equal to the columns of our original data sets. We can now proceed to merge our new data sets.






new_fraud=new_Claim.merge(new_Demographics, on="CustomerID")\
.merge(new_Policy, on="CustomerID")



Function to check if data is Data Frame




def check_is_dataframe(df):
    assert isinstance(df, pd.DataFrame), f"Error: data is not Data Frame."
    print("Data is Data Frame")


check_is_dataframe(new_fraud)
## Data is Data Frame


## Shape new_fraud: (8912, 37)


Our new data frame has 8,912 rows and 37 columns.

## new_fraud data types
##  CustomerID                     object
## DateOfIncident                 object
## TypeOfIncident                 object
## TypeOfCollission               object
## SeverityOfIncident             object
## AuthoritiesContacted           object
## IncidentState                  object
## IncidentCity                   object
## IncidentAddress                object
## IncidentTime                    int32
## NumberOfVehicles                int32
## PropertyDamage                 object
## BodilyInjuries                  int32
## Witnesses                      object
## PoliceReport                   object
## AmountOfTotalClaim             object
## AmountOfInjuryClaim             int32
## AmountOfPropertyClaim           int32
## AmountOfVehicleDamage           int32
## InsuredAge                    float64
## InsuredZipCode                float64
## InsuredGender                  object
## InsuredEducationLevel          object
## InsuredOccupation              object
## InsuredHobbies                 object
## CapitalGains                  float64
## CapitalLoss                   float64
## Country                        object
## InsurancePolicyNumber         float64
## CustomerLoyaltyPeriod         float64
## DateOfPolicyCoverage           object
## InsurancePolicyState           object
## Policy_CombinedSingleLimit     object
## Policy_Deductible             float64
## PolicyAnnualPremium           float64
## UmbrellaLimit                 float64
## InsuredRelationship            object
## dtype: object



Feature Engineering





new_fraud_v2=new_fraud.copy()


Reviewing Data Types of the new data we notice that certain columns that are numeric in our original train/test data have a data type of object. We’ll use a function to transform these columns to a numeric data type.






def convert_object_to_int(df, columns):
    """
    Converts specified object columns to int32.

    Parameters:
    df (pd.DataFrame): The DataFrame containing object columns.
    columns (list): List of column names to convert.

    Returns:
    pd.DataFrame: DataFrame with transformed columns.
    """
    for col in columns:
        df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int32')  
    




convert_object_to_int(new_fraud_v2,['AmountOfTotalClaim'])


## new_fraud_v2 data types
##  CustomerID                     object
## DateOfIncident                 object
## TypeOfIncident                 object
## TypeOfCollission               object
## SeverityOfIncident             object
## AuthoritiesContacted           object
## IncidentState                  object
## IncidentCity                   object
## IncidentAddress                object
## IncidentTime                    int32
## NumberOfVehicles                int32
## PropertyDamage                 object
## BodilyInjuries                  int32
## Witnesses                      object
## PoliceReport                   object
## AmountOfTotalClaim              Int32
## AmountOfInjuryClaim             int32
## AmountOfPropertyClaim           int32
## AmountOfVehicleDamage           int32
## InsuredAge                    float64
## InsuredZipCode                float64
## InsuredGender                  object
## InsuredEducationLevel          object
## InsuredOccupation              object
## InsuredHobbies                 object
## CapitalGains                  float64
## CapitalLoss                   float64
## Country                        object
## InsurancePolicyNumber         float64
## CustomerLoyaltyPeriod         float64
## DateOfPolicyCoverage           object
## InsurancePolicyState           object
## Policy_CombinedSingleLimit     object
## Policy_Deductible             float64
## PolicyAnnualPremium           float64
## UmbrellaLimit                 float64
## InsuredRelationship            object
## dtype: object



We will use our previously create function to transform the date columns to a correct data type of data time.




convert_to_datetime(new_fraud_v2,'DateOfIncident')




convert_to_datetime(new_fraud_v2,'DateOfPolicyCoverage')



check_is_datetime(new_fraud_v2, 'DateOfIncident')
## Feature 'DateOfIncident' is datetime dtype


check_is_datetime(new_fraud_v2, 'DateOfPolicyCoverage')
## Feature 'DateOfPolicyCoverage' is datetime dtype



We have succesfully transformed the date columns to the appropriate dattetime data type.


We’ll now create new features from the date features.





new_fraud_v2["coverageIncidentDiff"]=(new_fraud_v2["DateOfIncident"]-new_fraud_v2["DateOfPolicyCoverage"])

new_fraud_v2["coverageIncidentDiff"]=new_fraud_v2["coverageIncidentDiff"]/np.timedelta64(1,'Y')


## count    8912.000000
## mean       13.130826
## std         6.591779
## min        -0.032855
## 25%         7.610697
## 50%        13.298014
## 75%        18.804630
## max        25.065539
## Name: coverageIncidentDiff, dtype: float64


The range of CoverageIncidentDiff goes from a minimum of -0.032855 to a maximum of 25.06.






new_fraud_v2['dayOfWeek'] = new_fraud_v2["DateOfIncident"].dt.day_name()
    


new_fraud_v2['dayOfWeek'].value_counts(normalize=True).round(2)
## Saturday     0.15
## Wednesday    0.15
## Tuesday      0.15
## Friday       0.14
## Thursday     0.14
## Monday       0.14
## Sunday       0.13
## Name: dayOfWeek, dtype: float64


## ******** Unique Number of Vehicles********
## [3 1 2 4]
## ******** Unique Bodily Injuries********
## [0 1 2]


## NumberOfVehicles and BodilyInjuries data type:
## NumberOfVehicles    int32
## BodilyInjuries      int32
## dtype: object


Both BodilyInjuries and NumberofVehicles have small number of unique values yet their data types are of type int. They would be best as categorical. We’ll apply our already created function to transform them.





convert_to_cat(new_fraud_v2, 'NumberOfVehicles') 




convert_to_cat(new_fraud_v2, 'BodilyInjuries') 


check_is_categorical(new_fraud_v2,'NumberOfVehicles')
## Feature 'NumberOfVehicles' is categorical dtype


check_is_categorical(new_fraud_v2,'BodilyInjuries')
## Feature 'BodilyInjuries' is categorical dtype



## *************Incident Time Unique Values*************
## [ 4 16 20 10  7 22  6 14 15 19 12 17 18  5 13 11 23  8 21  9  3  2  1  0
##  -5]






time_day={
  
  5:'early morning', 6:'early morning',7:'early morning',  8:'early morning',9:'late morning', 10: 'late morning', 11: 'late morning', 12:'early afternoon', 13:'early afternoon', 14:'early afternoon', 15:'early afternoon',16:'late afternoon', 17:'late afternoon', 18:'evening',
  19:'evening', 20:'night', 1:'night', 2:'night', 3:'night', 4:'night', 21:'night', 22:'night', 23:'night', 24:'night'
}





new_fraud_v2['IncidentPeriodDay']=new_fraud_v2['IncidentTime'].map(time_day)


## ***Incident Period Day Value Counts***
## night              2333
## early morning      1765
## early afternoon    1732
## late morning       1149
## late afternoon     1018
## evening             808
## Name: IncidentPeriodDay, dtype: int64



## Data frame includes datatypes object True





new_fraud_v3=new_fraud_v2.copy()



new_fraud_v3=new_fraud_v3.drop(['DateOfIncident', 'DateOfPolicyCoverage', 'IncidentTime'], axis=1)


As with our original data set, we will convert object data types to categorical.






convert_cats(new_fraud_v3)


check_no_object_dtype(new_fraud_v3)
## ✅ No object dtype columns found in the DataFrame.



gs=plt.GridSpec(1, 3)
fig=plt.figure(figsize=(10,8))
fig.suptitle('Categorical Counts-1', fontsize=8)


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])

#plt.title('Type of Incident',fontsize=7, y=1)
hg=sns.countplot(data = new_fraud_v3, x = 'TypeOfIncident', ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Type of Incident", fontsize=5) 
hg.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
sp=sns.countplot(data=new_fraud_v3, x='TypeOfCollission', ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=5)
sp.set_xlabel("Type of Collision", fontsize=5) 
sp.set_ylabel("Count",fontsize=4) 
#plt.title('Reported Fraud',fontsize=7, y=1)
bp=sns.countplot(data=new_fraud_v3, x='SeverityOfIncident', ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("SeverityOfIncident", fontsize=5) 
bp.set_ylabel("Count", fontsize=5) 

plt.tight_layout()

plt.show()

plt.clf()





my_tab=pd.crosstab(index=new_fraud_v3["TypeOfIncident"], columns=new_fraud_v3["TypeOfCollission"], normalize=True).round(2)


fig = plt.figure(figsize=(13, 10))

sns.heatmap(my_tab, cmap="BuGn",cbar=False, annot=True,linewidth=0.3)

plt.yticks(rotation=0)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0, 0.5, 'Multi-vehicle Collision'), Text(0, 1.5, 'Parked Car'), Text(0, 2.5, 'Single Vehicle Collision'), Text(0, 3.5, 'Vehicle Theft')])
plt.xticks(rotation=60)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, '?'), Text(1.5, 0, 'Front Collision'), Text(2.5, 0, 'Rear Collision'), Text(3.5, 0, 'Side Collision')])
plt.title('Type of Incident vs Type of Collision', fontsize=20)
plt.xlabel('TypeOfCollision', fontsize=15)
plt.ylabel('TypeOIncident', fontsize=15)

plt.show()

plt.clf()




new_fraud_v4=new_fraud_v3.copy()



We observe from the cross table that the ‘unknown’ type of collision is only associated with a small number of incident types related to collisions. These data points will be retained by renaming the “unknown” column to “none”.




new_fraud_v4['TypeOfCollission'] = new_fraud_v4['TypeOfCollission'].replace(['?'], 'None')



plt.figure(figsize=(16,10))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=new_fraud_v4, x='TypeOfCollission')
#plt.tick_params(label_rotation=45)
ax.tick_params(axis='both', which='major', labelsize=11)
ax.set_title("Type of Collision-Changed", size=22)
ax.set(xlabel=None)
ax.set(ylabel=None)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()

gs=plt.GridSpec(2, 3)
fig=plt.figure(figsize=(11,6))
fig.suptitle('Categorical Counts-2', fontsize=8)


ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
ax4=fig.add_subplot(gs[1, 0])
ax5=fig.add_subplot(gs[1, 1])
ax6=fig.add_subplot(gs[1,2])

#plt.title('Type of Incident',fontsize=7, y=1)
c1=sns.countplot(data = new_fraud_v4, x = 'Witnesses', ax=ax1)
c1.tick_params(axis='both', which='major', labelsize=4)
c1.set_xlabel('Witnesses', fontsize=5) 
c1.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
c2=sns.countplot(data=new_fraud_v4, x='BodilyInjuries', ax=ax2)
c2.tick_params(axis='both', which='major', labelsize=5)
c2.set_xlabel("Bodily Injuries", fontsize=5) 
c2.set_ylabel("Count",fontsize=4) 
#plt.title('Reported Fraud',fontsize=7, y=1)
c3=sns.countplot(data=new_fraud_v4, x='PropertyDamage', ax=ax3)
c3.tick_params(axis='both', which='major', labelsize=5)
c3.set_xlabel("Property Damage", fontsize=5) 
c3.set_ylabel("Count", fontsize=5) 
c4=sns.countplot(data = new_fraud_v4, x = 'NumberOfVehicles', ax=ax4)
c4.tick_params(axis='both', which='major', labelsize=4)
c4.set_xlabel("Number Of Vehicles", fontsize=5) 
c4.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
c5=sns.countplot(data=new_fraud_v4, x='IncidentState', ax=ax5)
c5.tick_params(axis='both', which='major', labelsize=5)
c5.set_xlabel("Incident State", fontsize=5) 
c5.set_ylabel("Count",fontsize=4) 
#plt.title('Reported Fraud',fontsize=7, y=1)
c6=sns.countplot(data=new_fraud_v4, x='AuthoritiesContacted', ax=ax6)
c6.tick_params(axis='both', which='major', labelsize=5)
c6.set_xlabel("Authorities Contacted", fontsize=5) 
c6.set_ylabel("Count", fontsize=5)

plt.tight_layout()

plt.show()

plt.clf()



new_fraud_v5=new_fraud_v4.copy()



new_fraud_v5['Witnesses']=new_fraud_v5['Witnesses'].cat.remove_categories("MISSINGVALUE")



new_fraud_v5=new_fraud_v5.drop(['PropertyDamage'], axis=1)



plt.figure(figsize=(14,8))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=new_fraud_v5, x='Witnesses')
#plt.tick_params(label_rotation=45)
ax.set_title("Witnesses-Changed", size=20)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='both', which='major', labelsize=14)
sns.set_style("dark")

ax.annotate('Figure ##',

            xy = (1.0, -0.2),

            xycoords='axes fraction',

            ha='right',

            va="center",

            fontsize=10)
            
fig.tight_layout()

plt.show()

plt.clf()

new_fraud_v5.isna().sum() > 0
## CustomerID                    False
## TypeOfIncident                False
## TypeOfCollission              False
## SeverityOfIncident            False
## AuthoritiesContacted          False
## IncidentState                 False
## IncidentCity                  False
## IncidentAddress               False
## NumberOfVehicles              False
## BodilyInjuries                False
## Witnesses                      True
## PoliceReport                  False
## AmountOfTotalClaim             True
## AmountOfInjuryClaim           False
## AmountOfPropertyClaim         False
## AmountOfVehicleDamage         False
## InsuredAge                    False
## InsuredZipCode                False
## InsuredGender                  True
## InsuredEducationLevel         False
## InsuredOccupation             False
## InsuredHobbies                False
## CapitalGains                  False
## CapitalLoss                   False
## Country                        True
## InsurancePolicyNumber         False
## CustomerLoyaltyPeriod         False
## InsurancePolicyState          False
## Policy_CombinedSingleLimit    False
## Policy_Deductible             False
## PolicyAnnualPremium           False
## UmbrellaLimit                 False
## InsuredRelationship           False
## coverageIncidentDiff          False
## dayOfWeek                     False
## IncidentPeriodDay              True
## dtype: bool


From the above output we find there are features containing null (missing) values. Before we remove any missing values, we’ll droop features that will not be used in our models.





new_fraud_v6=new_fraud_v5.copy()





new_fraud_v6=new_fraud_v6.drop(['CustomerID', 'IncidentAddress', 'InsuredZipCode', 'InsuredHobbies','Country', 'InsurancePolicyNumber', 'IncidentCity','AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 
  'InsuredEducationLevel','InsuredOccupation','PoliceReport'], axis=1)


new_fraud_v6.isna().sum()
## TypeOfIncident                  0
## TypeOfCollission                0
## SeverityOfIncident              0
## AuthoritiesContacted            0
## IncidentState                   0
## NumberOfVehicles                0
## BodilyInjuries                  0
## Witnesses                      12
## AmountOfTotalClaim              8
## InsuredAge                      0
## InsuredGender                   8
## CapitalGains                    0
## CapitalLoss                     0
## CustomerLoyaltyPeriod           0
## InsurancePolicyState            0
## Policy_CombinedSingleLimit      0
## Policy_Deductible               0
## PolicyAnnualPremium             0
## UmbrellaLimit                   0
## InsuredRelationship             0
## coverageIncidentDiff            0
## dayOfWeek                       0
## IncidentPeriodDay             107
## dtype: int64




new_fraud_v7=new_fraud_v6.copy()



new_fraud_v7=new_fraud_v7.dropna()


new_fraud_v7.isna().sum()
## TypeOfIncident                0
## TypeOfCollission              0
## SeverityOfIncident            0
## AuthoritiesContacted          0
## IncidentState                 0
## NumberOfVehicles              0
## BodilyInjuries                0
## Witnesses                     0
## AmountOfTotalClaim            0
## InsuredAge                    0
## InsuredGender                 0
## CapitalGains                  0
## CapitalLoss                   0
## CustomerLoyaltyPeriod         0
## InsurancePolicyState          0
## Policy_CombinedSingleLimit    0
## Policy_Deductible             0
## PolicyAnnualPremium           0
## UmbrellaLimit                 0
## InsuredRelationship           0
## coverageIncidentDiff          0
## dayOfWeek                     0
## IncidentPeriodDay             0
## dtype: int64


All null values have been removed from our data.


new_fraud_v7[new_fraud_v7['PolicyAnnualPremium']==-1].shape
## (47, 23)


-1 level in Policy Annual Premium represents missing values.





new_fraud_v8=new_fraud_v8[new_fraud_v8['PolicyAnnualPremium']!=-1]


new_fraud_v8[new_fraud_v8['PolicyAnnualPremium']==-1].shape
## (0, 23)


The PolicyAnnualPremium feature now has no missing features (category -1).


Data Preparation



First step is to confirm our new data new_fraud_v8 has the same columns as our training data. We’ll then check if the levels of our categorical columns of both data frames are equal.







new_data=new_fraud_v8.copy()



print("X_train and new_data columns are equal:",X_train.columns.equals(new_data.columns))
## X_train and new_data columns are equal: True




categorical, numerical=define_columns(new_data)


## Categorical Features:  ['TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'InsuredGender', 'InsurancePolicyState', 'Policy_CombinedSingleLimit', 'InsuredRelationship', 'dayOfWeek', 'IncidentPeriodDay']


## numerical Features:  ['AmountOfTotalClaim', 'InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'coverageIncidentDiff']



Let’s check that the levels from the new data new_fraud_v8 are the same as those categorical levels from the X_train data.






def assert_categorical_levels_match(df1, df2, categorical_columns):
    """
    Checks if the unique values (levels) of categorical columns in two DataFrames match.

    Parameters:
    df1 (pd.DataFrame): First DataFrame
    df2 (pd.DataFrame): Second DataFrame
    categorical_columns (list): List of categorical column names to compare

    Raises:
    AssertionError: If any categorical column has mismatched levels between the two DataFrames.
    """
    for col in categorical_columns:
        levels_df1 = set(df1[col].unique())
        levels_df2 = set(df2[col].unique())
        
        assert levels_df1 == levels_df2, f"Mismatch in column '{col}': {levels_df1 ^ levels_df2}"

    print("All categorical column levels match.")


assert_categorical_levels_match(X_train, new_data,categorical)
## All categorical column levels match.



Now that we have confirmed categorical levels of both data frames match we can now transform them in preparation of fitting our model.




X_train_tr, X_new=transform_x_columns_tr(X_train, new_data)



## Rows of First three Columns
##     num__AmountOfTotalClaim  num__InsuredAge  num__CapitalGains
## 0                 0.639019        -1.489358           1.199641
## 1                 0.121226         0.141468           1.210473
## 2                 0.289220         0.016020           0.260888
## 3                -1.870518        -0.109428           1.636523
## 4                -0.694800        -1.238462           0.430586



## Test Model








import unittest



model=xgb_base_clf




X_test=X_new.copy()


Our next step is to test the XGBoost classifier on our new data to ensure that it returns the expected outputs of [0,1] and returns probabilities bewtween 0 and 1.





test_results = []


# Define a test class with CSV logging
class TestModelInference(unittest.TestCase):
    def setUp(self):
        self.model = model
        self.X_test = X_test

    def test_prediction_output_values(self):
        """Test that model predictions contain only valid class labels."""
        start_time = time.time()
        pred = self.model.predict(self.X_test)
        unique_values = np.unique(pred)
        for value in unique_values:
            self.assertIn(value, [0, 1])
        elapsed_time = time.time() - start_time
        test_results.append(["Prediction Output Values", "Pass", round(elapsed_time, 4)])

    def test_prediction_probabilities(self):
        """Test that the model returns valid probability values between 0 and 1."""
        start_time = time.time()
        prob_pred = self.model.predict_proba(self.X_test)
        self.assertTrue(np.all((prob_pred >= 0) & (prob_pred <= 1)), "Probabilities must be between 0 and            1")
        self.assertTrue(np.allclose(prob_pred.sum(axis=1), 1, atol=1e-6), "Sum of probabilities must be             close to 1")
        elapsed_time = time.time() - start_time
        test_results.append(["Prediction Probabilities", "Pass", round(elapsed_time, 4)])
    def test_prediction_time(self):
        """Test that the model predicts within an acceptable time limit."""
        start_time = time.time()
        _ = self.model.predict(self.X_test)
        elapsed_time = time.time() - start_time
        self.assertLess(elapsed_time, 1, f"Prediction took too long: {elapsed_time:.4f} seconds")
        test_results.append(["Prediction Time", "Pass" if elapsed_time < 1 else "Fail", round(elapsed_time,          4)])


# Run tests and capture results
if __name__ == "__main__":
    print("\n===== Running Model Tests =====\n")

    # Redirect unittest output to a buffer
    test_buffer = StringIO()
    runner = unittest.TextTestRunner(stream=test_buffer, verbosity=2)
    unittest.main(argv=['model;, x_test'], exit=False, testRunner=runner)

    # Convert test results to a DataFrame
    df_results = pd.DataFrame(test_results, columns=["Test Name", "Status", "Execution Time (s)"])

    # Save results to CSV
    df_results.to_csv("test_results.csv", index=False)

    print("\nTest results saved to test_results.csv")
## 
## ===== Running Model Tests =====
## 
## <unittest.main.TestProgram object at 0x3607c8130>
## 
## Test results saved to test_results.csv
Model Test Results
Test.Name Status Execution.Time..s.
Prediction Output Values Pass 0.0041
Prediction Probabilities Pass 0.0030
Prediction Time Pass 0.0037




Our classifier passed the tests. Now we will use the classifier to predict on the new data.



Predict and Evaluate






xg_predictions=xgb_base_clf.predict(X_new)






xg_results_df = new_data.copy()
xg_results_df["Predicted_Label"] = xg_predictions


Let’s compare our predicted labels(fraud) of the new data to the reported fraud in our original data.


## Predicted Fraud Percentages from New Data:
## 0    0.85
## 1    0.15
## Name: Predicted_Label, dtype: float64


## Reported Fraud Percentages from Original Data:
## N    0.73
## Y    0.27
## Name: ReportedFraud, dtype: float64


The results show that our model predicted 15% of our observations as fraud compared to 27% of the original data reported as fraud. This is a 12% difference. Let’s check the prediction results on our original test data.




xg_test_predictions=xgb_base_clf.predict(X_test_tr)

xg_test_results_df = X_test_tr.copy()


xg_test_results_df["Predicted_Label"] = xg_test_predictions



test_count=xg_test_results_df["Predicted_Label"].value_counts(normalize=True).round(2)
## Reported Fraud Percentages from Original Data:
## 0    0.77
## 1    0.23
## Name: Predicted_Label, dtype: float64


The predicted labels of the test data are closer to the the original data than the those from the new data.


Let’s take a look at the predicted probabilities for both the new and test data.





xg_probs_new=xgb_base_clf.predict_proba(X_new)




xg_probs_df = pd.DataFrame(xg_probs_new, columns=['fraud_no', 'fraud_yes'])


## First Five Rows of xg_probs_df
##     fraud_no  fraud_yes
## 0  0.990794   0.009206
## 1  0.877024   0.122976
## 2  0.873884   0.126116
## 3  0.909902   0.090098
## 4  0.926205   0.073795




xg_probs_test=xgb_base_clf.predict_proba(X_test_tr)



xg_test_probs_df = pd.DataFrame(xg_probs_test, columns=['fraud_no', 'fraud_yes'])


## First Five Rows of xg_test_probs_df
##     fraud_no  fraud_yes
## 0  0.149526   0.850474
## 1  0.932576   0.067424
## 2  0.027026   0.972974
## 3  0.922639   0.077361
## 4  0.956429   0.043571


xg_probs_df["fraud_yes"].hist()
plt.title("Distribution Predicted Fraud Probabilities on New Data")
plt.show()

plt.clf()

xg_test_probs_df["fraud_yes"].hist()
plt.title("Distribution Predicted Fraud Probabilities on Test Data")
plt.show()

plt.clf()

The distributions appear similar. We are interested in predicted probabilities of 50% or greater. The xg boost classifier appears to be marginally stronger at predicting reported fraud of yes on the original test data at probabilities of 70% or higher. We’ll filter our data to observe just predicted probabilities of fraud over 70%.











xg_probs_ovr_sventyPct=xg_probs_ovr_sventyPct[xg_probs_ovr_sventyPct.fraud_yes >= 0.7]

new_probs_PctOvrSvnty=len(xg_probs_ovr_sventyPct)/len(xg_probs_df)*100


xg_test_probs_ovr_svntyPct=xg_test_probs_df.copy()

xg_test_probs_ovr_svntyPct=xg_test_probs_ovr_svntyPct[xg_test_probs_ovr_svntyPct.fraud_yes >=0.7]



test_probs_PctOvrSvnty=len(xg_test_probs_ovr_svntyPct) / len(xg_test_probs_df)*100
## Among cases correctly predicted as fraud (Yes), 12.30% of the predictions in the new data had
##  a predicted probability greater than 70%, compared to 18.14% in the test data.



It appears the xg boost classifier predicted a higher percentage of fraud on the original test set than on our new data. Assessment on the xg boost classifier would benefit from collecting additional data before a definitive judgement can be made.